QCFasta restart from incompleted/updated run with large number of inputs is very slow

Issue #59 new
Yilin Li created an issue

When rerunning qc workflow with QCFasta due to crash/error or addition of new samples, QCFasta takes days (around 3-4 days to start with almost 1000 fastq files) to start sending new tasks, most likely due to the high number of components involved. Based on logs it seems to take minutes to check the progress of a single component in the starting phase, and overall the memory usage of the master seems very high (20+G), so the issue may be related to io.

Comments (2)

  1. Ville Rantanen

    this is a legit problem. running test case6, which has a single fq file, creates over 100 instances, and 10 callbacks (before it fails on me..)

    there may be point to remove some parallelization, since the benefit of it is lost if you have so many samples. You can of course just split your data in to smaller sets and run several pipelines.. but i understand thats not very convenient.

    I've tackled with "too many instances" problem also by splitting my pipeline in to two consecutive parts, and use the outputs of the other as inputs of the second (using PipelineInput)

  2. Ville Rantanen

    for instance rows:

    176     for ( (k,v) <- iterArray(reads) )                                           
    177     {                                                                           
    178       statusString(k) = status(k).textRead()                                                                                            
    179       statusCSV(k) = StringInput(content="Sample\tStatus\n"+k+"\t"+statusString(k))
    180     } 
    

    will read the status file in the for loop, every time you run it, be there changes or not. while the port.textRead() is a nice feature, it does not scale well.

    replacing with QuickBash and having the status as input for it would be faster..

  3. Log in to comment