Root Cause: Billing Migration Failure

Root Cause: Billing Migration Failure

Introduction

Four Azure Subscriptions needed to be moved billing wise to another Tenant. The transfer, in this case, had to be done by SURF. anDREa tested together with SURF a test subscription and this was moved correctly without any of the test users experiencing any problems.
A billing movement is a 'paper' action that does not affect any resources.
SURF started the moving on 2022-01-10 @ 12:50 CET.

Issue

From 2022-01-10 @ 13:38 CET more and more users start reporting they can no longer see their Workspace(s) residing in the Azure Subscriptions that were marked to be moved billing wise.

Impact of the issue

  1. No data nor VMs were lost
  2. Workspaces were not accessible for users
    1. Approximately 250+ Studies, 250+ VMs, and 800+ users were affected; +/-25% of the total studies/users were affected, due to holidays actual number of users directly affected was lower.
    2. +/- 6 working hours for EMC and UMCU
      1. +/- 10 working hours for AUMC Workspaces because AUMC prefered to move the subscription on 2022-01-11 in the morning

Root cause

SURF admin put a check at the checkbox to move the ownership of the Azure Subscriptions resulting that those Azure Subscriptions and all resources in those Subscriptions were no longer managed by anDREa: not accessible for users.

Resolution

After understanding the issue and testing the solution:
  1. Request the owners of the moved Azure Subscriptions to move it back to anDREa AAD
  2. Develop en test scripts to restore the RBAC, Policies, and generic settings
  3. Run the scripts, test & verify

Lessons learned

  1. First response time could have been quicker
  2. Tenants must be asked for:
    1. direct contact details of Microsft Azure Subscription Owners / Global Admins
    2. time window when those contacts can be reached
  3. Collaborating with SURF to identify the root cause went well and significantly reduced the resolution time
  4. Offboarding and Exit Strategy works (moving ownership of a subscription is fast, easy, and requires no collaboration of anDREa bv)
  5. Incident Management Procedure and Disaster Recovery Plan tested in practice

Timeline

Time stamp
Action
Effect
2022-01-10 @ 12:50 CET
SURF started the transfer of the four Azure Subscriptions marked for transfer. Unlike in the test, a checkbox was checked that also would transfer ownership of those Azure Microsoft Subscriptions.
anDREa AAD was no longer attached, resulting in that all the Workspaces (200+ studies, 250+ VMs) were no longer available for 300+ users.
2022-01-10 @ 13:38 CET
Users starting to report they can't see their Workspace
The assumption that it might resolve itself within an hour or so.
2022-01-10 @ 16:00 CET
Escalation protocol
100% attention on understanding and resolving the problem
2022-01-10 @ 17:00 CET
anDREa started the investigation together with SURF
Potential cause identified: subscription ownership was transferred. 
Concluding that no data or VMs were lost/damaged
Starting to construct RBAC and Policy rebuilding scripts

Contact EMC Azure Admin (had the Subscription with the smallest number of Workspaces) to verify and 
Subscription moved back to anDREa AAD
Testing the RBAC and Policy rebuilding scripts
2022-01-10 @ 17:05 CET
Informing users through Announcements and Telegram channel

2022-01-10 @ 17:58 CET

2022-01-10 @ 18:05 CET
EMC subscriptions restored to EMC MG, deploying RBAC and Policy Rebuilding scripts
Testing the RBAC and Policy rebuilding scripts on a limited number of Workspaces and refining RBAC and Policy rebuilding scripts
2022-01-10 @ 17:40 CET
AUMC confirms that subscription will be moved on 2022-01-11 early in the morning
No restauration can be started till the subscription movement
2022-01-10 @ 19:43 CET
UMCU subscription moved
Restauration can start as soon as the EMC Workspaces are done
2022-01-10 @ 20:24 CET
UMCU is moved, RBAC and Policy rebuilding scripts are being deployed
Restauration of UMCU subscription started
2022-01-10 @ 22:15 CET

1 EMC subscription restored, waiting for policies to take effect, license server not yet attached
2022-01-10 @ 23:08 CET

UMCU subscription restored, waiting for policies to take effect, license server not yet attached
2022-01-10 @ 23:42 CET

2nd EMC subscription restored, waiting for policies to take effect, license server not yet attached
2022-01-11 @ 08:23 CET
Adding license servers
For both EMC and UMCU subscriptions adding and resizing VMs, and license server works
2022-01-11 @ 08:45 CET
Users start reporting that Workspaces are working fine again

2022-01-11 @ 10:00 CET
AUMC Azure Subscription ownership is being transferred
Restauration of AUMC subscription started
2022-01-11 @ 13:41 CET
AUMC RBAC and Policy rebuilding scripts are being deployed
AUMC can test from 13:48 CET
2022-01-11 @ 14:40 CET
AUMC confirms verification is successful.
Deescalating the problem, planning clean up work
Issue closed