{"__v":0,"_id":"58b5a0af52c5380f00839e1e","initVersion":{"_id":"55faf11ba62ba1170021a9aa","version":"1.0"},"project":"55faf11ba62ba1170021a9a7","user":{"_id":"5767bc73bb15f40e00a28777","username":"","name":"Marko Marinkovic"},"createdAt":"2017-02-28T16:09:19.027Z","changelog":[],"body":"Many bioinformaticians will have experienced the frustration of inexplicably failed analyses, incorrect outputs, or inconsistent results. When dealing with complex tools and gigabytes or even terabytes of data, it is particularly important to minimize the potential for human error in your analyses. For this post, we've interviewed some of our experienced bioinformaticians to collect their tips and best practices for creating \"bulletproof\" workflows on the CGC.\n\nThe CGC brings the power of cloud infrastructure to your bioinformatics analyses. You can add your own tools to the CGC and build them into workflows in combination with the publicly available tools. A workflow is a chain that is only as strong as its weakest link. To create a stable and consistent workflow, care needs to be taken when wrapping user-uploaded tools and when piping tools together into a single functional unit.\n\n**Know thy tool(s)**. Each workflow consists of a series of tools that have their own properties and behaviors. Before wrapping a tool for use on the CGC, you should know how the tool works, what it expects as its inputs, what its available parameters are, whether the tool's behavior and performance depends on the types and sizes of inputs, and any other details that might affect the way the tool functions with other tools on the CGC. We recommend that you test the tool locally first, and only wrap it for use on the CGC once you have confirmed that it meets your requirements and are familiar with its inputs and outputs.\n\nIt’s also important at this stage to assess the hardware requirements of the tool (CPU, memory and disk space), at least for the input file size, type, and parameters of your intended use case. This type of information will be important for selecting an adequate instance type on the CGC. To read more on choosing instance types, see the post [Making efficient use of compute resources](blog:making-efficient-use-of-compute-resources).\n\n**Wrap and test tools separately**. In the context of the CGC, \"wrapping\" a tool is the process of specifying details of its input and output ports, its command options, and the CPU and memory resources it requires, so the tool can be used on the CGC. This specification is made using the Tool Editor. It generates a Common Workflow Language (CWL) description of the tool, that the CGC can read in order to run the tool. Before including a tool in a workflow, make sure it is wrapped properly by running it on its own on the CGC: to do this, you'll have to supply all the tool's input files, including any that would otherwise be generated by an upstream tool in a workflow. When running the tool on its own, we recommend testing it using different (sets of) input files of different sizes and specificities. Then, when you are sure that the tool works the way it is intended to, you can start connecting it to other tools.\n\nHere are some more things that you might want to pay attention to when wrapping a tool:\n\n* If a tool requires a [secondary file](page:secondary-files), such as an index file, to be supplied along with the \"primary\" file on an input port, make sure that the input port is configured to find and read the secondary file.\n\n* Make sure to specify correct CPU and memory requirements for your tool. The CGC scheduling algorithm will use the values supplied for these fields in the tool editor to select a suitable computation instance to run the tool on. If the values you enter are lower than the tool requires given your specified inputs, the tool will fail. On the other hand, if the values you enter are too high you may end up paying for compute resources you are not using.\n\n* Find edge cases that cause the tool to fail (if any). To do this, use input files that differ in size and other specific characteristics, as well as different values of input parameters.\n\n* If there is a parameter that will cause the tool to misbehave unless it is given a specific value, define this default value and [\"lock\" it](doc:the-pipeline-editor#section-expose-tool-parameters-for-configuration), so it cannot be changed on execution. However, make sure to provide an explanation of why you have locked the parameter in the tool description. Locking is also good practice for \"technical\" parameters that do not directly impact the results provided by the tool.\n\n**Assemble the workflow section by section**. A step-by-step approach makes debugging a lot easier. It is always a good idea to create smaller \"pieces\" of a workflow that include just a small number of tools. This way you will be able to see how the tools operate together and whether there are any errors or unexpected behaviors that need to be addressed. Again, to test the sub-workflow, provide test input files and see if this part of the workflow provides correct and expected outputs. If so, it is ready to be connected to the other pieces of the workflow.\n\n**Connect all the pieces into a single workflow**. When putting all the components together, pay attention to the following features:\n\n* **Stage input**. This option on the Tool Editor allows you to make a tool's output files available in the working directory of the next downstream tool in the workflow. Learn more about stage input and its [common use cases](doc:tool-input-ports#section-stage-input).\n\n* **Metadata inheritance**. Several tools require their input files to be annotated with appropriate metadata values. It is therefore important to set up your tools to so that metadata can be inherited from input files to output files. Learn more on configuring tools' handling of metadata in the documentation on [tool output ports](doc:tool-output-ports).\n\n* **Secondary files**. These files are usually index files that allow faster random access to a file containing genomic data. Some tools require index files to be present along with the data file; to ensure that index files are always provided, we recommend including a suitable indexing tool upstream from it. This might slightly extend the execution time of a workflow, but will prevent it from failing if the required secondary file is not present in the project.\n\nAfter thorough testing and incremental assembly, your workflow should be bulletproof. Good luck!\n\nDon't forget that our interdisciplinary support team are on hand at all times to help you debug your analyses if you need them. Don't hesitate to click the **Get support** button on the CGC if you'd like a helping hand or second opinion.\n\n### What next?\n\nTo read more about advanced techniques that can be used on the CGC, see our series 'Tool wrapping tips and tricks' in the [CGC documentation](http://docs.cancergenomicscloud.org/docs/).","slug":"creating-a-bulletproof-workflow","title":"Creating a bulletproof workflow"}

Creating a bulletproof workflow


Many bioinformaticians will have experienced the frustration of inexplicably failed analyses, incorrect outputs, or inconsistent results. When dealing with complex tools and gigabytes or even terabytes of data, it is particularly important to minimize the potential for human error in your analyses. For this post, we've interviewed some of our experienced bioinformaticians to collect their tips and best practices for creating "bulletproof" workflows on the CGC. The CGC brings the power of cloud infrastructure to your bioinformatics analyses. You can add your own tools to the CGC and build them into workflows in combination with the publicly available tools. A workflow is a chain that is only as strong as its weakest link. To create a stable and consistent workflow, care needs to be taken when wrapping user-uploaded tools and when piping tools together into a single functional unit. **Know thy tool(s)**. Each workflow consists of a series of tools that have their own properties and behaviors. Before wrapping a tool for use on the CGC, you should know how the tool works, what it expects as its inputs, what its available parameters are, whether the tool's behavior and performance depends on the types and sizes of inputs, and any other details that might affect the way the tool functions with other tools on the CGC. We recommend that you test the tool locally first, and only wrap it for use on the CGC once you have confirmed that it meets your requirements and are familiar with its inputs and outputs. It’s also important at this stage to assess the hardware requirements of the tool (CPU, memory and disk space), at least for the input file size, type, and parameters of your intended use case. This type of information will be important for selecting an adequate instance type on the CGC. To read more on choosing instance types, see the post [Making efficient use of compute resources](blog:making-efficient-use-of-compute-resources). **Wrap and test tools separately**. In the context of the CGC, "wrapping" a tool is the process of specifying details of its input and output ports, its command options, and the CPU and memory resources it requires, so the tool can be used on the CGC. This specification is made using the Tool Editor. It generates a Common Workflow Language (CWL) description of the tool, that the CGC can read in order to run the tool. Before including a tool in a workflow, make sure it is wrapped properly by running it on its own on the CGC: to do this, you'll have to supply all the tool's input files, including any that would otherwise be generated by an upstream tool in a workflow. When running the tool on its own, we recommend testing it using different (sets of) input files of different sizes and specificities. Then, when you are sure that the tool works the way it is intended to, you can start connecting it to other tools. Here are some more things that you might want to pay attention to when wrapping a tool: * If a tool requires a [secondary file](page:secondary-files), such as an index file, to be supplied along with the "primary" file on an input port, make sure that the input port is configured to find and read the secondary file. * Make sure to specify correct CPU and memory requirements for your tool. The CGC scheduling algorithm will use the values supplied for these fields in the tool editor to select a suitable computation instance to run the tool on. If the values you enter are lower than the tool requires given your specified inputs, the tool will fail. On the other hand, if the values you enter are too high you may end up paying for compute resources you are not using. * Find edge cases that cause the tool to fail (if any). To do this, use input files that differ in size and other specific characteristics, as well as different values of input parameters. * If there is a parameter that will cause the tool to misbehave unless it is given a specific value, define this default value and ["lock" it](doc:the-pipeline-editor#section-expose-tool-parameters-for-configuration), so it cannot be changed on execution. However, make sure to provide an explanation of why you have locked the parameter in the tool description. Locking is also good practice for "technical" parameters that do not directly impact the results provided by the tool. **Assemble the workflow section by section**. A step-by-step approach makes debugging a lot easier. It is always a good idea to create smaller "pieces" of a workflow that include just a small number of tools. This way you will be able to see how the tools operate together and whether there are any errors or unexpected behaviors that need to be addressed. Again, to test the sub-workflow, provide test input files and see if this part of the workflow provides correct and expected outputs. If so, it is ready to be connected to the other pieces of the workflow. **Connect all the pieces into a single workflow**. When putting all the components together, pay attention to the following features: * **Stage input**. This option on the Tool Editor allows you to make a tool's output files available in the working directory of the next downstream tool in the workflow. Learn more about stage input and its [common use cases](doc:tool-input-ports#section-stage-input). * **Metadata inheritance**. Several tools require their input files to be annotated with appropriate metadata values. It is therefore important to set up your tools to so that metadata can be inherited from input files to output files. Learn more on configuring tools' handling of metadata in the documentation on [tool output ports](doc:tool-output-ports). * **Secondary files**. These files are usually index files that allow faster random access to a file containing genomic data. Some tools require index files to be present along with the data file; to ensure that index files are always provided, we recommend including a suitable indexing tool upstream from it. This might slightly extend the execution time of a workflow, but will prevent it from failing if the required secondary file is not present in the project. After thorough testing and incremental assembly, your workflow should be bulletproof. Good luck! Don't forget that our interdisciplinary support team are on hand at all times to help you debug your analyses if you need them. Don't hesitate to click the **Get support** button on the CGC if you'd like a helping hand or second opinion. ### What next? To read more about advanced techniques that can be used on the CGC, see our series 'Tool wrapping tips and tricks' in the [CGC documentation](http://docs.cancergenomicscloud.org/docs/).