Why things are broken ? It’s simply because of changes. Can we avoid changes? No. Changes could be come from internal or external.
Do we know what will be impacted if there is a tiny change? A good system can track about 90% what is related. There are some minor changes that might be never recorded.
We can’t avoid change, we know we might miss something, so let minimize the risk. How can we do that? we need a system to monitor our system, if something go wrong, that monitor system will trigger an alert for us and we can take action as soon as possible.
I have system running for years, it’s very low maintenance. One day, we have to replace the server. I think it’s easy, i know what the main components are and what to migrate. What i don’t track is some minor things such as firewall settings, replication.
When i do the migration, everything seems working properly until i shutdown the old server. I got a bunch of alerts. This tell me that my system is not in good health and i can take actions right away.
Luckily, every services that i run, i put them in a monitoring system (nagios), i can go there and check what is not working properly. I got bad habit not documenting the system, but it’s good that i have a system that can monitor our applications.
You don’t need a fancy monitoring system, you can use some open source like nagios. Try to avoid using script to monitor since it’s for a simple task only.