by Andrew Boag
Catalyst started up in 1997, and from humble beginnings as an outsourced services company we have grown to over 250 staff globally, with seven offices across Australia, New Zealand and the United Kingdom. Delivering open source software solutions to large and small clients is what we do. It’s what we love.
Our approach is for each office to service their own region, connecting our local team members with local projects and engagements. This is one of our strengths and points of difference. We aim to build long term ongoing relationships with local clients, where it’s more than just a transactional project relationship.
From time to time, we farm work out between offices, for example a large project might land in Sydney that we need a bit more project muscle for. This has allowed us to punch above our weight with big projects when we set up an office in a new geographic region.
Catalyst's European office is in Brighton, UK and has a growing team of developers, business analysts and system administrators. Some of the larger Moodle LMS managed services engagements for our European University clients require 24x7 infrastructure and application monitoring. Our current global cloud platform for enterprise service delivery is Amazon Web Services. Catalyst is an AWS Partner and we know the toolset well, having built and managed a number of large workloads.
Catalyst has been involved in High Availability (HA) application design and architecture for a while. However, even the perfect system still needs a defined escalation framework for when issues occur. We aim to detect and fix issues before our clients even realise.
Historically, Catalyst have used an 'on call' pager roster (even though pagers are almost dead) for our infrastructure team. Responsibilities are shared across the team for out-of-hours service. Of course we pay extra for this, but it’s no one’s preference that our staffers are up all night dealing with alerts and breakages. In at least one case, a noisy pager has been the cause of serious marital stress - wife with newborn baby sending husband with beeping pager to the lounge to sleep!
In the interest of providing the highest level of service and reliability for our clients (and letting our infrastructure team get more sleep), the Australian, New Zealand and European offices decided to set up a Follow The Sun (FTS) support model. The idea being that we would share responsibility for systems across time zones, ideally the technician responding and investigating an alert is 'in sunlight', i.e. not waking up at 3 in the morning. This approach is becoming more and more common with technical and development teams distributed across the globe.
Our FTS programme has now been up and running for over 18 months. We started discussing it in 2015, and the first round of cross-team alerts went out in Jan 2016. It has been quite a journey.
Here are some of the things we've learned along the way.
Mandated inter-team communications
This means phone conferences, video catch ups on a regular basis with an agenda. These meetings will not happen by themselves. Maintaining regularity between Australia and the UK is a challenge when there are no overlapping work hours. It’s either early in the morning or late at night for one side. Things have to be planned and agreed well in advance.
It’s always better to talk than to not talk. Even if there’s nothing to talk about, we discuss what's happened recently, any event notifications or changes on either side.
Walking together technically
Catalyst is all about the application of free and open source technologies to deliver value for our clients. This means that we embrace the use of new toolsets and technologies, innovation is in our DNA.
However, when we are responsible for fixing a complicated web application hosted in AWS that the technician may have not built. It’s critical that all team members have a broad understanding of how things fit together. Even better if the solution’s architecture team has committed to building systems in a standard-ish fashion.
Given the fast-moving pace of cloud hosting services, and the broad requirements of our different global clients. All the regional teams need to be free to do what they need to do for better customer outcomes. This needs to be balanced with some level of standardisation in terms of build and deployment policy. This is a challenging problem that is not new.
We have learnt that too much control around change or tool sets is counter productive. But huge deviation in standard operations is also not ideal.
There is no magic wand here. Most important is people talking to people - especially at the senior technical level. Combined with good documentation practices this builds trust and a collaborative tone. Meaning that innovations by one team are move likely to get adopted by all, not ignored or vetoed.
The right communication and alert tools
The ability to communicate across the team, from any device, is critical. It should not be hard to reach out. And there should be a clear and concise audit trail of events and actions taken.
We also need confidence that in the instance of an alert getting missed, the global escalation framework is solid and gets all the way to the CTO if required.
No hiding from mistakes
Management responsibilities for enterprise applications is nothing new to Catalyst. And we have the scars and stories to prove it.
In the real world systems break and people make mistakes … bad stuff happens. Despite this, the bigger mistake is to sweep these events under the carpet or descend into the blame game. Focus needs to be on taking steps to analyse and improve the underlying system to make sure problems don’t recur.
Don’t get frustrated. Get better.
There was at least six months of planning and discussions prior to the first cross-team alert happening. So after all this effort, what are the real benefits to our team and clients?
- Catalyst can provide better system support for our clients. More daytime attention to systems when they are in need.
- Less sleep interruption for our valued sysadmins! Before we wake someone up, another capable team member in the sunshine on the other side of the planet reviews and (ideally) resolves the issue. And in the past too much night time alert activity has caused some of our team members to find another job.
- Ability to perform out-of-hours updates and upgrades for our clients. It’s now very simple for us to roll out changes at 3am local time with a day or two planning.
- More flexibility for team size for project and build work. We are move able to lean on each other across regions as we are working more with each other.
- All round better communication between the Catalyst offices. A good thing and something you can’t take for granted when everyone is busy on projects and dealing with endless business activity.
We consider this initiative a great success. It allows all parts of the Catalyst group to provide better services to our clients.
Special thanks to Alex Lawn from the Sydney team who is driving this initiative.