WGS Upscaling - IT & Bioinformatics Evaluation

Data transfer, Data storage, Bioinformatics pipeline capacity
Author
Affiliation

GDx OUSAMG

Published

February 5, 2024

1 Background


GDx at OUSAMG is planning to upscale the WGS production to 4 x 48 samples or 2 x 48 + 1 x 96 samples per week. Do we have enough capacity in IT and bioinformatics pipelines for this upscaling?

The capacity of IT & bioinformatics pipelines can be evaluated from following three aspects:

  1. Data transfer speed
  2. Data storage
  3. Pipeline capacity

2 Data transfer speed


Both sequencing data and NSC pipeline results are stored at the NSC. So the volume of data that needs to be transferred from NSC to TSD is huge. The data transfer is done by the nsc-exporter. The nsc-exporter uses TSD s3api which in turn uses s3cmd under the hood. The nsc-exporter will check for new data to transfer every 10 minutes and uses s3cmd put to transfer the data.


flowchart LR
    subgraph dt[nsc-exporter]
    check-new-data(((New data?))) --> |Yes| transfer(Transfer to TSD) 
    transfer --> sleep(Sleep 10 minutes)
    sleep --> check-new-data
    check-new-data --> |No| sleep
    end
    subgraph dp[data producer]
    lims-exporter[lims-exporter]  ---> |produce| check-new-data
    pipeline[NSC pipeline] ---> |produce| check-new-data
    end
    style dt fill:#e4eda6,stroke-width:3px
    style dp fill:#aab0a2


2.1 Data Collection

To evaluate the data transfer speed from NSC to TSD, we collected the historical data transfer records between 2023-09-01 08:41:40 and 2023-11-30 20:14:10 from the nsc-exporter log.

         [,1]                                     
datetime "2023-10-01 21:46:38"                    
project  "wgs316"                                 
filename "Diag-wgs316-HG21989353-DR.bamout.mt.bam"
bytes    "8754041"                                
seconds  "0.4"                                    
speed    "23170000"                               
         [,1]                                                      
datetime "2023-11-17 16:45:32"                                     
project  "wgs332"                                                  
filename "96775424702-169170FM-TrioFar-KIT-wgs_S23_R2_001.fastq.gz"
bytes    "45163992771"                                             
seconds  "614.4"                                                   
speed    "70100000"                                                
         [,1]                                 
datetime "2023-11-26 13:45:51"                
project  "wgs336"                             
filename "Diag-wgs336-HG79457619-DR.final.vcf"
bytes    "1208093827"                         
seconds  "13.4"                               
speed    "86080000"                           
         [,1]                            
datetime "2023-11-01 19:18:04"           
project  "wgs326"                        
filename "Diag-wgs326-04357832702.sample"
bytes    "1551"                          
seconds  "0"                             
speed    "38470"                         
         [,1]                                                                                  
datetime "2023-11-23 02:54:26"                                                                 
project  "wgs334"                                                                              
filename "231120_A00943_0791_BHMVLCDSX7.31637902802-161030FM-TrioFar-KIT-wgs_S11_R2_001.qc.pdf"
bytes    "117005"                                                                              
seconds  "0"                                                                                   
speed    "3010000"                                                                             
The nsc-exporter log files and the sequencer overview html files were ignored for simplicity. 1

2.2 Data Overview

The size of transferred files ranges from 0.0 B to 100.9 GiB. The average file size is 1.5 GiB. The median file size is 9.3 KiB. The standard deviation is 8.1 GiB.

         filesize
Min.        0.0 B
1st Qu.   421.0 B
Median    9.3 KiB
Mean      1.5 GiB
3rd Qu. 968.0 KiB
Max.    100.9 GiB

The transfer speed ranges from 1.0 B/s to 93.1 MiB/s. The average transfer speed is 12.2 MiB/s. The median transfer speed is 286.7 KiB/s. The standard deviation is 23.4 MiB/s. Transfer speed for small files is usually very low, so the average transfer speed is not a good indicator for actual transfer speed, see Section 2.3.1.

        speed(/s)
Min.        1.0 B
1st Qu.  11.9 KiB
Median  286.7 KiB
Mean     12.2 MiB
3rd Qu.   8.4 MiB
Max.     93.1 MiB

The transfer time ranges from 0 seconds to 2084.4 seconds. The average transfer time is 19.5 seconds. The median transfer time is 0 seconds. The standard deviation is 104 seconds.

    seconds       
 Min.   :   0.00  
 1st Qu.:   0.00  
 Median :   0.00  
 Mean   :  19.47  
 3rd Qu.:   0.10  
 Max.   :2084.40  

2.3 Correlation

Next, we would like to know the how does file size affect transfer speed and transfer time.

2.3.1 Transfer speed and time VS file size (all files)


Small files have lower transfer speed. A good transfer speed around 80 MB/s can be achieved for large files (>2 GB). However, the best speed is observed for files with size around 200 MB (zoom in or see Figure 5).