Parallelize a Bash FOR Loop

  • I have been trying to parallelize the following script, specifically each of the three FOR loop instances, using GNU Parallel but haven't been able to. The 4 commands contained within the FOR loop run in series, each loop taking around 10 minutes.

    #!/bin/bash
    
    kar='KAR5'
    runList='run2 run3 run4'
    mkdir normFunc
    for run in $runList
    do 
      fsl5.0-flirt -in $kar"deformed.nii.gz" -ref normtemp.nii.gz -omat $run".norm1.mat" -bins 256 -cost corratio -searchrx -90 90 -searchry -90 90 -searchrz -90 90 -dof 12 
      fsl5.0-flirt -in $run".poststats.nii.gz" -ref $kar"deformed.nii.gz" -omat $run".norm2.mat" -bins 256 -cost corratio -searchrx -90 90 -searchry -90 90 -searchrz -90 90 -dof 12 
      fsl5.0-convert_xfm -concat $run".norm1.mat" -omat $run".norm.mat" $run".norm2.mat"
      fsl5.0-flirt -in $run".poststats.nii.gz" -ref normtemp.nii.gz -out $PWD/normFunc/$run".norm.nii.gz" -applyxfm -init $run".norm.mat" -interp trilinear
    
      rm -f *.mat
    done
    
  • goldilocks

    goldilocks Correct answer

    7 years ago

    Why don't you just fork (aka. background) them?

    foo () {
        local run=$1
        fsl5.0-flirt -in $kar"deformed.nii.gz" -ref normtemp.nii.gz -omat $run".norm1.mat" -bins 256 -cost corratio -searchrx -90 90 -searchry -90 90 -searchrz -90 90 -dof 12 
        fsl5.0-flirt -in $run".poststats.nii.gz" -ref $kar"deformed.nii.gz" -omat $run".norm2.mat" -bins 256 -cost corratio -searchrx -90 90 -searchry -90 90 -searchrz -90 90 -dof 12 
        fsl5.0-convert_xfm -concat $run".norm1.mat" -omat $run".norm.mat" $run".norm2.mat"
        fsl5.0-flirt -in $run".poststats.nii.gz" -ref normtemp.nii.gz -out $PWD/normFunc/$run".norm.nii.gz" -applyxfm -init $run".norm.mat" -interp trilinear
    }
    
    for run in $runList; do foo "$run" & done
    

    In case that's not clear, the significant part is here:

    for run in $runList; do foo "$run" & done
                                       ^
    

    Causing the function to be executed in a forked shell in the background. That's parallel.

    That worked like a charm. Thank you. Such a simple implementation (Makes me feel so stupid now!).

    In case I had 8 files to run in parallel but only 4 cores, could that be integrated in such a setting or would that require a Job Scheduler?

    It doesn't really matter in this context; it's normal for the system to have more active processes than cores. If you have *many short tasks*, ideally you would feed a queue serviced by a number or worker threads < the number of cores. I don't know how often that is really done with shell scripting (in which case, they wouldn't be threads, they'd be independent processes) but with *relatively few long tasks* it would be pointless. The OS scheduler will take care of them.

    You also might want to add a `wait` command at the end so the master script does not exit until all of the background jobs do.

    I would also fine it useful to limit the number of concurrent processes: my processes each use 100% of a core's time for about 25 minutes. This is on a shared server with 16 cores, where many people are running jobs. I need to run 23 copies of the script. If I run them all concurrently, then I swamp the server, and make it useless for everyone else for an hour or two (load goes up to 30, everything else slows way down). I guess it could be done with `nice`, but then I don't know if it'd ever finish..

    Ahh, PSkocik's answer below has a really simple solution.

    What does local run=$1 mean here?

    Any solutions for multiple loops? I tried this one fot 150k iterations and it seems to work awful. And I understand why, but I don't know how to express "Queue workitems in thread pool and parallelize them somehow" in bash.

License under CC-BY-SA with attribution


Content dated before 6/26/2020 9:53 AM