Sheriffing/How To/Data Ingestion Backlog

From MozillaWiki
Jump to: navigation, search

If pushes get shown by Treeherder but tasks for them with much delay, either

  • these tasks get generated with delay in the Taskcluster instance
  • the Pulse queue with the notification about the task’s generation does not get updated immediately
  • Treeherder fails to access the Pulse queue’s notification, process it or is slowed down, e.g. by too many or too expensive database operations.

Identification of data pipeline part in which the issue starts

  • Taskcluster or not: Shortly after a task got shown with delay (e.g. a gecko decision task shown much later than the push), select it in Treeherder and then click at the bottom left on the link to its Taskcluster page.
    • Compare the task start time there with the time the task started to be shown as running in Treeherder:
      • If there is difference of more than 2 minutes, it’s not a Taskcluster issue.
      • If the task is also shown as delayed in Taskcluster, it is an issue in the Taskcluster instance, e.g. unavailability of cloud machines used for worker instances which shall execute the tasks.
        • This is unlikely to be a bug in the Taskcluster code itself, more likely an issue based on circumstances or misconfiguration of the Taskcluster instance or ci-configuration, both managed by Release Engineering.
        • Escalation channel: #taskcluster-cloudops in Slack
  • Treeherder or Pulse: Latest task submission, start or end time by tree - execute the query. Times are in UTC.
    • If the times do not change, this might be an issue with the notification about the task from Taskcluster to Treeherder through Pulse.
      • Ask Ops / SRE in #treeherder-ops in Slack to check if there are many unacknowledged messages in the queue for Treeherder.
        • If Yes: It’s a Treeherder issue. Let Ops / SRE in #treeherder-ops in Slack check the health of Treeherder.
          • Most commonly, the database is under heavily load. If the query shows progress for the times by tree, the issue should resolve itself eventually.
        • If No: It’s a Taskcluster or Pulse issue. Let Ops / SRE in #taskcluster-cloudops in Slack check if Taskcluster tries to emit notifications about the task where it gets lost.