Design‎ > ‎

N3phele-Torque Design Concerns

Enabling N3phele to submit tasks to Torque will require a number of additions to its current implementation.

First, it will require the creation of a new factory (e.g. "pollingFactory") that queues tasks from the n3phele service instead of pushing them to VMs. Second, it will require the creation of a new agent that will run on the Torque head node and request tasks from the pollingFactory. These changes will reverse the direction of communication between the factory and the agent. Currently the agent has endpoints for transferring data and executing commands which are invoked by the factory, however, now the factory will simply queue these requests and the agent will pull a request and then execute it (transfer data or submit a command to the queue).

The tasks consist of instructions to transfer files between different sources and destinations (e.g. HTTP -> cluster storage, cluster storage -> S3, etc.) or commands that need to be queued by Torque and executed on a cluster worker node. Presumably these tasks will have some partial ordering, i.e., before executing a command certain input data will need to be transferred to the cluster storage and after the command executes the output may need to be transferred elsewhere. These steps must be performed in a particular order, specifically: transfer input, execute command, transfer output. However, within these broad steps, the order does not matter. Input files can be transferred in any order as long as they are all transferred before the command begins execution (same goes for output after the command completes execution). However, the design is complicated further if it is assumed that a single pollingFactory will serve multiple Torque agents, e.g., different Torque agents must request tasks that belong to a single group (one Torque agent shouldn't transfer input data to its cluster storage while another Torque agent executes the command, for example, a single Torque agent should service all related instructions: transfer input, execute command(s), and transferring the output).

These issues raise a number of concerns that will impact the design of N3phele-Torque integration:
  • Will multiple Torque agents be allowed for a single N3phele deployment?
    • If so, will there be a single pollingFactory that will service multiple Torque agents or one pollingFactory per Torque agent? A single pollingFactory will allow the Torque agents to load balance themselves (i.e. they can only request new work when they are ready to handle it) whereas one pollingFactory per Torque agent requires the N3phele service to determine ahead of time (and presumably without load information) which Torque server to send instructions to (this could result in certain Torque clusters becoming overloaded while others sit idle).
    • If so and if there is only a single pollingFactory for all agents, how will different Torque agents ensure that they receive all related instructions (transfer input, commands to execute, transfer output)? 
  • Because of the reverse in direction between the factory and agent, the agent will need to be able to distinguish between requests to transfer data and requests to execute commands. (Currently the factory invokes a different endpoint in the agent depending on whether data is to be transferred or a command is to be executed.) 



Comments