Uptime & resilience

We have a live public dashboards showing uptime for Plan✕

Production https://status.planx.uk/
Staging https://status.planx.dev/

| Guaranteed uptime | The service is currently hosted on AWS and therefore benefits from their underlying availability (99.9% uptime) and resilience.

In the event that the PlanX service itself should go down our team will be alerted automatically and seek to restore the service as quickly as possible.

No refunds are currently agreed in the event of service downtime as part of the SLA. We strongly encourage shorter contract periods and encourage customers not to renew if they are dissatisfied with the service availability. | | --- | --- | | Approach to resilience | The weakest links in Plan✕ are our integrations with customer or third part hosts. Our integrations fall into one of two categories:

Plan✕ is pulling data from another source (eg Ordnance Survey or Planning Data). If these sources are unavailable, Plan✕ is designed to continue to function seamlessly but any live applicants may have to answer more questions on their form than normal.
Plan✕ also pushes or submits data to other sources (eg GOV UK Pay, back-office systems). If these sources are unavailable, the applicant may be unable to pay for or submit their application until the service comes back up. Plan✕ is configured to automatically retry failed events and will retain user data and progress to ensure that no data is lost. | | Outage reporting | Customers will be notified of any planned outages by email in advance, and such outages will be timed to minimise disruption. In the event of any unplanned outage, the Customer will be informed as quickly as possible by email. | | Minimising processor / memory storage utilisation | Plan✕ is deployed on cloud-based infrastructure (AWS). We use caching to reduce the possibility of potential usage spikes and we monitor containers to ensure that they are restarted on error or if there are memory leaks etc. | | Response time for transactions | Network traffic response times will be monitored, and reports can be shared with Customer on request. | | Scaling | We have carefully chosen to work with technologies that can handle thousands of requests per second on extremely modest hardware.

We enable dynamic scaling / elastic load balancing to ensure that our infrastructure is responsive to load and always optimal in terms of hardware specification and the number of instances that deliver the service.

We use Cloudflare for load monitoring. | | Load testing | We maintain a dedicated set of scripts (written in Python using Locust), which we can use to load test our AWS stack.

We run these against staging from time to time, as well as whenever there is a significant uptick in the quality or quantity of PlanX usage. We then examine the results and adjust how we provision our services in order to anticipate said usage.

We also simulate workloads using these tools whenever we make any significant infrastructure changes or service expansion events, including public launches for new local authorities.

See also: ‣ | | Monitoring load | We will also be carefully monitoring metrics such as active users and response times to ensure that they do not negatively impact one another. | | Disaster recovery | In the event of a main system failure our steps would be:

Identify point of failure.
Agree response strategy to get service back online.
Set a status page and phase banner notification (if appropriate) onto the service to keep users up to date.
Test and redeploy.

In all but the most exceptional circumstances Recovery Time Objective (RTO) would be within 24 hours.

We have tried to mitigate the effects of a disaster by deploying our services in separate containers on cloud infrastructure. Complete data backups are stored several times per day (RPO). (See Backups) | | Fault reporting & response | In many cases, our product team will be automatically notified of any issues (see monitoring).

When an issue is reported to us, we will immediately decide:

If it can be fixed / resolved immediately, in which case we will do so.
If it is sufficiently serious that users or admins need to be notified. Also to consider any knock on effects and any who we are obliged to inform.
If no, its priority level. In the case of high priority issues we will keep in contact with the reporter until the issue is fixed.

If an admin is not satisfied that the team is responding in a reasonable way, they should escalate it to the CEO. |