> Every company out there is using the cloud and yet still employs infrastructure engineers to deal with its complexity. The "cloud" reducing staff costs is and was always a lie.
This doesn’t make sense as an argument. The reason the cloud is more complex is because that complexity is available. Under a certain size, a large number of cloud products simply can’t be managed in-house (and certainly not altogether).
Also your argument is incorrect in my experience.
At a smaller business I worked at, I was able to use these services to achieve uptime and performance that I couldn’t achieve self-hosted, because I had to spend time on the product itself. So yeah, we’d saved on infrastructure engineers.
At larger scales, what your false dichotomy suggests also doesn’t actually happen. Where I work now, our data stores are all self-managed on top of EC2/Azure, where performance and reliability are critical. But we don’t self-host everything. For example, we use SES to send our emails and we use RDS for our app DB, because their performance profiles and uptime guarantees are more than acceptable for the price we pay. That frees up our platform engineers to spend their energy on keeping our uptime on our critical services.
>At a smaller business I worked at, I was able to use these services to achieve uptime and performance that I couldn’t achieve self-hosted, because I had to spend time on the product itself. So yeah, we’d saved on infrastructure engineers.
How sure are you about that one? All of my hetzner vm`s reach an uptime if 99.9% something.
I could see more then one small business stack fitting onto a single of those vm`s.
100% certain because I started by self hosting before moving to AWS services for specific components and improved the uptime and reduced the time I spent keeping those services alive.
What was work you spend configuring those services and keeping them alive? I am genuinely curious...
We have a very limited set of services, but most have been very painless to maintain.
A Django+Celery app behind Nginx back in the day. Most maintenance would be discovering a new failure mode:
- certificates not being renewed in time
- Celery eating up all RAM and having to be recycled
- RabbitMQ getting blocked requiring a forced restart
- random issues with Postgres that usually required a hard restart of PG (running low on RAM maybe?)
- configs having issues
- running out of inodes
- DNS not updating when upgrading to a new server (no CDN at the time)
- data centre going down, taking the provider’s email support with it (yes, really)
Bear in mind I’m going back a decade now, my memory is rusty. Each issue was solvable but each would happen at random and even mitigating them was time that I (a single dev) was not spending on new features or fixing bugs.
I mean, going back a decade might be part of the reason?
Configs having issues is like number 1 reason i like the setup so much..
I can configure everything on my local machine and test here, and then just deploy it to a server the same way.
I do not have to build a local setup, and then a remote one
Er… what? Even in today’s world with Docker, you have differences between dev and prod. For a start, one is accessed via the internet and requires TLS configs to work correctly. The other is accessed via localhost.
I use a https for localhost, there are a ton of options for that.
But yes, the cert is created differently in prod and there are a few other differences.
But it's much closer then in the cloud.
Just fyi, you can put whatever you want in /etc/hosts, it gets hit before the resolver. So you can run your website on localhost with your regular host name over https.
I’m aware, I just picked one example but there are others like instead of a mail server you’re using console, or you have a CDN.
Just because your VM is running doesn't mean the service is accessible. Whenever there's a large AWS outage it's usually not because the servers turned off. It also doesn't guarantee that your backups are working properly.
If you have a server where everything is on the server, the server being on means everything is online... There is not a lot of complexity going on inside a single server infrastructure.
I mean just because you have backups does not mean you can restore them ;-)
We do test backup restoration automatically and also on a quarterly basis manually, but so you should do with AWS.
Otherwise how do you know you can restore system a without impact other dependency, d and c
Yes, mix-and-match is the way to go, depending on what kind of skills are available in your team. I wouldn't touch a mail server with a 10-foot pole, but I'll happily self-manage certain daemons that I'm comfortable with.
Just be careful not to accept more complexity just because it is available, which is what the AWS evangelists often try to sell. After all, we should always make an informed decision when adding a new dependency, whether in code or in infrastructure.
Of course AWS are trying to sell you everything. It’s still on you and your team to understand your product and infrastructure and decide what makes sense for you.