> So you're stuck debugging a system you don't control, through screenshots and copy-pasted logs on a Zoom call.

This is very real.

I work with a deployment that operates in this fashion. Although unfortunately, we can't maintain _any_ connection back to our servers. Pull or push, doesn't matter.

The goal right now is to build out tooling to export logs and telemetry data from an environment, such that a customer could trigger that export on our request, or (ideally) as part of the support ticketing process. Then our engineers can analyze async. This can be a ton of data though, so we're trying to figure out what to compress and how. We also have the challenge of figuring out how to scrub logs of any potentially sensitive information. Even IDs, file names, etc that only matter to customers.

> Although unfortunately, we can't maintain _any_ connection back to our servers. Pull or push, doesn't matter.

We're working on something for this! Stay tuned.

I also used to work with on-premise installs of Kubernetes and their “security” postures prevented any in-bound access. It was a painful process of requesting access, getting on a zoom call and then controlling their screen via a Windows client and putty. It’s was beyond painful and frustrating. I tried to pitch using a tool like Twingate which doesn’t open any inbound ports, can be locked down very tight using SSO, 2fa, access control rules, and IP limiting but to no avail. They were stuck in their Windows based IT mentally.

At least they didn't ask you to TeamViewer into a Windows Server 2012 box and open Event Viewer..

That would be my preference compared to the situation you're replying to. Event Viewer is powerful if one takes some time to learn it.

Fair point

> This can be a ton of data though, so we're trying to figure out what to compress and how. We also have the challenge of figuring out how to scrub logs of any potentially sensitive information.

This is fundamentally a data modeling problem. Currently computer telemetry data are just little bags of utf-8 bytes, or at best something like list<map<bytes, bytes>>. IMO this needs to change from the ground up. Logging libraries should emit structured data, conforming to a user supplied schema. Not some open-ended schema that tries to be everything to everyone. Then it's easy to solve both problems--each field is a typed column which can be compressed optimally, and marking a field as "safe" is something encoded in its type. So upon export, only the safe fields make it off the box, or out of the VPC, or whatever--note you can have a richer ACL structure than just "safe yes/no".

I applaud the industry for trying so hard for so long to make everything backwards compatible with the unstructured bytes base case, but I'm not sure that's ever really been the right north star.

Grand solutions require broad coordination, and they often devolve back into a modified-but-equivalent version of the previous problem. :(

Stream-of-bytes is classically difficult model to escape. Many have tried.

Yeah. There are good reasons things are bad. But there's also a foolish consistency. Like, you can just do things! If you decide monitoring is important you can decide not to outsource it. Most everyone doesn't, though. Probably because they don't think it's very important, and the existing tools get it done well enough, and it's the muscle memory of the subjectively familiar (if objectively fantastically overpriced).

Well, in the early days of infrastructure growth, when designing bespoke monitoring systems and protocols would be relatively low-cost, it's still nowhere near the highest-ROI way to spend your tech team's time and energy.

And to do it right (i.e. low-risk of of having it blow up with negative effects on the larger business goals), you need someone fairly experienced or maybe even specialized in that area. If you have that person, they are on the team because of their other skills, which you need more urgently.

SaaS, COTS, and open source monitoring tools have to cater to the existing customers. The sales pitch is "easy to integrate". So even they are not incentivized to build something new.

It boils down to the fact that stream-of-bytes is extremely well-understood, and almost always good enough. Infinitely flexible, low-ceremony, no patents, and comes preinstalled on everything (emitters and consumers). It's like HTTP in that way.

And the evolution is similar too. It'll always be stream-of-bytes, but you can emit in JSON or protobuf etc, if it's worth the cognitive overhead to do so. All the hyperscalers do this, even when the original emitter (web servers, etc) is just blindly spewing atrocious CLF/quirky-SSV text.

> It'll always be stream-of-bytes, but you can emit in JSON or protobuf etc, if it's worth the cognitive overhead to do so.

This is the crux of it. That's great until you encounter a need for a schema, and then it's "schema-on-read" or some similar abomination. And the need might not manifest until you're pushing like 1TB/day or more of telemetry data with hundreds or thousands of engineers working on some >1MLoC monstrosity. Hard to dig out of that hole.

The situation is tragically optimal--we've achieved some kind of multiobjective local maximum on a rock in the sewer at the bottom of a picturesque alpine valley and declared victory. We should do better.

Or maybe I'm overly optimistic.

> The situation is tragically optimal--we've achieved some kind of multiobjective local maximum on a rock in the sewer at the bottom of a picturesque alpine valley and declared victory. We should do better.

But it's a very comfortable rock. pointy in all the right places.

til it ain't