{"title":"Multi-instance scheduling algorithm","slug":"multi-instance-scheduling-algorithm","body":"Jobs are subprocesses carried out in tool executions (tasks). Each job has different requirements in terms of CPU and memory, and so will have a particular class of suitable computation instances on which it can be run. Information about each job's requirements is inherited from the tool description.\n\nThe default procedure in which jobs are scheduled on the CGC minimizes the number of instances used, by fitting as many jobs as possible on an instance with sufficient resources to execute them. A sketch of the scheduling algorithm is reproduced below.\n\nYou may override the inherited requirements for a job by specifying the instance type that you want it to run on, using one of the methods documented in [Set computation instances](http://docs.cancergenomicscloud.org/v1.0/docs/set-computation-instances). \n[block:code]\n{\n  \"codes\": [\n    {\n      \"code\": \"def schedule(instances, jobs): \\n    unscheduled_jobs = prioritize_jobs(jobs)\\n    \\n   while unscheduled_jobs:\\n        unscheduled_jobs = fit_jobs(instances, unscheduled_jobs)\\n        instances += allocate_new_instances(unscheduled_jobs)\\n \\n    release_empty_instances(instances)\\n    return instances\",\n      \"language\": \"python\",\n      \"name\": \"Algorithm sketch. See below for more details of the functions.\"\n    }\n  ]\n}\n[/block]\nIn the algorithm above:\n* `jobs` is a list of jobs to be scheduled. Each job in `jobs` has associated CPU and memory requirements.\n* `instances` is a list of instances allocated for the task, with a reference to the amount of remaining resources each one has at that time.\n[block:callout]\n{\n  \"type\": \"warning\",\n  \"body\": \"The requirements of a job may not be determined until the initiation of the task, or during its runtime. For instance, a task may run parallel jobs (one per instance) for each contig; in this case, the number of instances required will depend on the input to the task. But this input may depend on the output of a previous node in a workflow.\",\n  \"title\": \"\"\n}\n[/block]\n*  `prioritize_jobs` is a function that orders a list of jobs by the cost of the instances each one requires, given its required CPU and memory resources.\n\n* `fit_jobs` is a function that goes through two loops: one for each job in `prioritize_jobs(jobs)`, and one for each instance in `instances`. It aims to fit each job to the first suitable instance in `instances`  Since `instances` are ordered so that instances with fewer available resources are ordered before ones with more available resources, `fit_jobs` results in dense packing of jobs onto instances.\n\nA job ‘fits’ on an instance if:\n1. The instance has at least as much CPU and memory than the job's CPU and memory requirements specify.\n2. The tool used in the job does not have a different instance type specified for it, by any of the methods documented in [Set computation instances](http://docs.cancergenomicscloud.org/v1.0/docs/set-computation-instances). Tools for which you have specified an instance type will always be fitted to the chosen instance type. \n\n* `allocate_new_instances` is a function that allocates a single instance per iteration of the while loop through `unscheduled_jobs`. It takes the first unallocated job from the list `prioritize_jobs(jobs)` , and allocates it one of the following instances:\n * the cheapest instance for which `instance_cpu >= job_cpu` and `instance_ram >= job_ram`, chosen from a list of instance types in `instances`.\n * the instance specified by the `InstanceType` hint, set using one of the methods documented in [Set computation instances](http://docs.cancergenomicscloud.org/v1.0/docs/set-computation-instances). If this instance doesn't satisfy `instance_cpu >= job_cpu` and `instance_ram >= job_ram`, then an error will be raised.\n[block:callout]\n{\n  \"type\": \"success\",\n  \"body\": \"If you install your own tool onto the CGC, using Rabix, you will be able to set its **required resources** in terms of CPU and memory. This is done on the tool editor. It is possible to set the required resources using [dynamic expressions](http://seven-bridges.readme.io/docs/dynamic-expressions-in-tool-descriptions): for instance, a tool's required memory may be 2x its input size. \\n\\nSetting a tool's required resources means that it will always be allocated an instance that is sufficient for it to run. However, if you opt to override the default instance type by specifying a different one, using one of the methods documented in [Set computation instances](http://docs.cancergenomicscloud.org/v1.0/docs/set-computation-instances), you may end up providing your tool with insufficient CPU or memory.\",\n  \"title\": \"Set a tool's required resources\"\n}\n[/block]","html":"","htmlmode":false,"fullscreen":false,"hidden":false,"_id":"5731cd093cc3840e0088a14e","project":"55faf11ba62ba1170021a9a7","user":{"name":"Kate Hodesdon","username":"","_id":"554340dfb7f4540d00fcef1d"},"__v":0,"createdAt":"2016-05-10T11:59:05.397Z","metadata":{"title":"","description":"","image":[]}}

Multi-instance scheduling algorithm


Jobs are subprocesses carried out in tool executions (tasks). Each job has different requirements in terms of CPU and memory, and so will have a particular class of suitable computation instances on which it can be run. Information about each job's requirements is inherited from the tool description. The default procedure in which jobs are scheduled on the CGC minimizes the number of instances used, by fitting as many jobs as possible on an instance with sufficient resources to execute them. A sketch of the scheduling algorithm is reproduced below. You may override the inherited requirements for a job by specifying the instance type that you want it to run on, using one of the methods documented in [Set computation instances](http://docs.cancergenomicscloud.org/v1.0/docs/set-computation-instances). [block:code] { "codes": [ { "code": "def schedule(instances, jobs): \n unscheduled_jobs = prioritize_jobs(jobs)\n \n while unscheduled_jobs:\n unscheduled_jobs = fit_jobs(instances, unscheduled_jobs)\n instances += allocate_new_instances(unscheduled_jobs)\n \n release_empty_instances(instances)\n return instances", "language": "python", "name": "Algorithm sketch. See below for more details of the functions." } ] } [/block] In the algorithm above: * `jobs` is a list of jobs to be scheduled. Each job in `jobs` has associated CPU and memory requirements. * `instances` is a list of instances allocated for the task, with a reference to the amount of remaining resources each one has at that time. [block:callout] { "type": "warning", "body": "The requirements of a job may not be determined until the initiation of the task, or during its runtime. For instance, a task may run parallel jobs (one per instance) for each contig; in this case, the number of instances required will depend on the input to the task. But this input may depend on the output of a previous node in a workflow.", "title": "" } [/block] * `prioritize_jobs` is a function that orders a list of jobs by the cost of the instances each one requires, given its required CPU and memory resources. * `fit_jobs` is a function that goes through two loops: one for each job in `prioritize_jobs(jobs)`, and one for each instance in `instances`. It aims to fit each job to the first suitable instance in `instances` Since `instances` are ordered so that instances with fewer available resources are ordered before ones with more available resources, `fit_jobs` results in dense packing of jobs onto instances. A job ‘fits’ on an instance if: 1. The instance has at least as much CPU and memory than the job's CPU and memory requirements specify. 2. The tool used in the job does not have a different instance type specified for it, by any of the methods documented in [Set computation instances](http://docs.cancergenomicscloud.org/v1.0/docs/set-computation-instances). Tools for which you have specified an instance type will always be fitted to the chosen instance type. * `allocate_new_instances` is a function that allocates a single instance per iteration of the while loop through `unscheduled_jobs`. It takes the first unallocated job from the list `prioritize_jobs(jobs)` , and allocates it one of the following instances: * the cheapest instance for which `instance_cpu >= job_cpu` and `instance_ram >= job_ram`, chosen from a list of instance types in `instances`. * the instance specified by the `InstanceType` hint, set using one of the methods documented in [Set computation instances](http://docs.cancergenomicscloud.org/v1.0/docs/set-computation-instances). If this instance doesn't satisfy `instance_cpu >= job_cpu` and `instance_ram >= job_ram`, then an error will be raised. [block:callout] { "type": "success", "body": "If you install your own tool onto the CGC, using Rabix, you will be able to set its **required resources** in terms of CPU and memory. This is done on the tool editor. It is possible to set the required resources using [dynamic expressions](http://seven-bridges.readme.io/docs/dynamic-expressions-in-tool-descriptions): for instance, a tool's required memory may be 2x its input size. \n\nSetting a tool's required resources means that it will always be allocated an instance that is sufficient for it to run. However, if you opt to override the default instance type by specifying a different one, using one of the methods documented in [Set computation instances](http://docs.cancergenomicscloud.org/v1.0/docs/set-computation-instances), you may end up providing your tool with insufficient CPU or memory.", "title": "Set a tool's required resources" } [/block]