On Friday, November 21st, the Da Vinci Engineering team noticed delayed processing of tasks from one of our background worker pools, and made a change to increase the scaling limits associated with those pools, as well as immediately scaling up resources to clear the backlog and ensure that plenty of workers would be in place for campaign preparation. However, a misconfiguration in the resource allocation policies for that worker pool caused the resources to shrink back to original levels over time instead of remaining at the increased levels, while still keeping the worker count high, leaving insufficient resources to be shared by that number of workers. This also caused downstream issues on Tuesday, November 25th (incident here), with similar behavior observed of some campaigns being stuck in the “syncing” or “scheduling” stages.
The incident was resolved by fixing the resource allocation policies to ensure that each worker had sufficient resources to operate effectively, and will continue to operate smoothly going forward.
The Engineering team is auditing the resource allocation policies for all other internal applications to ensure that they will not run into similar issues when scaling.