NetworkComputer: Why is performance important?
On modern hardware (e.g. a Solaris 10 Opteron based v20z system), the NetworkComputer software, thanks to its event-driven scheduler, can submit, dispatch, execute, and log 1,000 small jobs (e.g. "sleep 0") in less than 5 seconds, resulting in a per-job overhead of a few milliseconds.
Whether this low overhead makes an impact depends completely on the characteristics of the workload. The impact will be negligible if the workload consists, for example, of just a few thoudands jobs each lasting a few minutes or more.
On the other hand, the impact will be significant if the workload is large and consists of smaller and smaller jobs. This means that the designers are now free to submit to the farm their jobs regardless of their size. With traditional schedulers, it is not practical to submit thousands of short jobs (say less than 2s) because of the penalty caused by the cycle-based scheduler. The designers are therefore induced to bundle the small jobs into packages of 10 to 20 jobs, so that This packaging is an extra burden for the user of the farm, is an artificial constraint that leads to suboptimum execution and it is an obstacle to management of the jobs if failures occur in a subset of the jobs in each package.
With near-zero latency:
- The user can submit jobs to the farm in their most natural granularity, that is without artificial packaging
- If some jobs fail, it is easier to identify them and, if needed, rerun them
- As machines become faster and applications improve, the workloads tend to become more granular
