Hacker News

The networking stuff seems....odd.

'Networking was a substantial cost and required experimentation. We did not use DHCP as most enterprise switches don’t support it and we wanted public IPs for the nodes for convenient and performant access from our servers. While this is an area where we would have saved time with a cloud solution, we had our networking up within days and kinks ironed out within ~3 weeks.'

Where does the switch choice come into whether you DHCP? Wth would you want public IPs.

mystifyingpoi 3 days ago [ - ]

It really feels like they wanted 30 PB of storage accessible over HTTP and literally nothing else. No redundancy, no NAT, dead simple nginx config + some code to track where to find which file on the filesystem. I like that.

matt-p 3 days ago [ - ]

This was not written by a network person, quite clearly. Hopefully it's just a misunderstanding, otherwise they do need someone with literally any clue about networks.

g413n 3 days ago [ - ]

yeah misunderstanding we'll update the post-- separately it's true that we aren't network specialists and the network wrangling was prob disproportionately hard for us/ shouldn't have taken so long.

matt-p 3 days ago [ - ]

Massive props for getting it done anyway. For others reading: In general a switch should never run DHCPd, but will normally/often relay it for you, your arista's would 100% have supported relaying, but in this case it sounds like it might even be flat L2. Normally you'd host dhcpd on a server.

Some general feedback incase it's helpful.. -20K on contractors seems insane if we're talking about rack and stack for 10 racks. Many datacentres can be persuaded to do it for free as part of you agreeing to sign their contract. Your contractors should at least be using a server lift of some kind, again often provided kindly by the facility. If this included paying for server configuration and so on, then ignore that comment (bargin!).

-I would almost never expect to actually pay a setup fee (beyond something nominal like 500 per rack) to the datacentre either, certainly if you're going to be paying that fee it had better include rack and stack.

-A crash cart should not be used for a install of this size, the servers should be plugged into the network, and then automatically configured by a script/IPXE. It might sound intimidating or hard but it's not, doesn't even require IMPI (though frankly I would strongly, strongly recommend it, if you do't already have it). I would use managed switches for the management network too, for sure.

-Consider two switches, especially if they are second hand. The cost of the cluster not being usable for a few days while you source and install a replacement even here probably is still thousands.

-Personally not a big fan of the whole JBOD architecture and would have just filled by boots with single socket 4u supermicro chasis. To each their own, but JBOD's main benefit is a very small financial saving at the cost of quite a lot of drawbacks IMO. YMMV.

-Depending on who you use for GPUs, getting a private link or 'peering' to them might save you some cost and provide higher capacity.

-I'm kind of shocked that FMT2 didn't turn out much cheaper than your current colo, would expect less than those figures possibly with the 100G DIA included (normally about $3000/month no setup).

g413n 3 days ago [ - ]

def agree on the setup fees, that was just a price crunch to get it done within the weekend. (too short-notice for professional services, too sensitive for craigslist, so basically just paying a bunch of folks we already knew and trusted)

for IPXE do you have any reference material you'd recommend? we had 3 people each with reasonably substantial server experience try for like 6 hours each and for whatever reason it turned out to be too difficult.

rtp4me 3 days ago [ - ]

I have done a ton of iPXE boot setups in the past. We use iPXE at our DC location for imaging, system recovery, etc. In fact, I just finished up a new boot image that creates a 100MB virtual floppy drive used for BIOS updates. Reach out and I can provide the entire setup if you like (pxe config files, boot loaders, scripts, etc).

matt-p 2 days ago [ - ]

Similarly I'm happy to share my ipxe scripts. It's just one of those things that you need to understand the fundamentals of before you start. It's about a hundred lines of bash to setup.

typpilol 3 days ago [ - ]

I assume it was their first time setting up ipxe? There's a lot of hang nails with it depending on the infra you're using it in.

For 10 racks it might not make sense.

toast0 2 days ago [ - ]

Honestly, with 10 servers, a pxe setup is probably overkill. If you're getting used servers (and maybe even if not), you might need to poke them with a KVM to set the boot options so that PXE is an option, and you might want to configure the BMC/IPMI from the console too, and then configure anything for serial over IPMI / bios console on serial ports... do that in your office, since your colo is across the street, and then you may as well do the install too. Then when you install, it should just work and crash cart if not. But, PXE is fun, so...

For PXE / iPXE, there's several stages of boot. You have your NIC's option rom, which might be, but probably is not iPXE. That will hit DHCP to get its own IP and also request info about where to pull boot files. You'll need to give it a tftp server IP and a filename. DHCPD config below

I server iPXE executables to non-iPXE. When iPXE starts up, it again asks DHCP, but now you can give it an http boot script. The simplest thing is to have something like

   kernel installer_kernel
   initrd installer_initrd
   boot

You can also boot isos, but that's a lot easier if you're in BIOS boot rather than UEFI. Better to practice booting kernels and initrds (unless you need to boot things like firmware update isos)

Then you'll have your installer (or whatever) booted, and you might have an unattended install setup for that, or you can just setup a rescue image that does dhcp (again!) and opens sshd so you can shell in and do whatever. Up to you.

the pxe part of my isc dhcpd config is:

   next-server 203.0.113.11;

   if exists user-class and option user-class = "iPXE" {
       option ipxe.no-pxedhcp 1;
       filename "http://203.0.113.11/tftpboot/menu.ipxe";
   } else {
       if option client-arch = 00:06 {
           filename "ipxe.efi-i386";
       } else if option client-arch = 00:07 {
           filename "ipxe.efi-x86_64";
       } else {
           filename "undionly.kpxe";
       }
   }

(This is mostly consoldidating bits and pieces from here [1] )

And I have those three files in the root of my tftp server. There's all sorts of other stuff you could do, but this should get you started. You don't really need iPXE either, but it's a lot more flexible if you need anything more, and it can load from http which is gobs faster if you have large payloads.

If you really wanted to be highly automated, your image could be fully automated, pull in config from some system and reconfigure the BMC while it was there. But there's no need for that unless you've got tons of servers. Might be something to consider if you mass replace your disk shelves with 4U disk servers, although it might not save a ton of time. If you're super fancy, your colo network would have different vlans and one of them would be the pxe setup vlan --- new servers/servers needed reimaging could be put into the pxe vlan and the setup script could move them into the prod vlan when they're done. That's fun work, but not really needed, IMHO. Semi-automated setup scales a lot farther than people realize, couple hundred servers at least. autopw [2] can help a lot!

[1] https://ipxe.org/howto/dhcpd

[2] https://github.com/jschauma/sshscan/blob/master/src/autopw

trebligdivad 3 days ago [ - ]

I assume your actual training is being done somewhere else? Did you try getting colocation space in the same datacentre as somewhere with the compute - it would have reduced your internet costs even further.

g413n 3 days ago [ - ]

yeah the cost calculus is very different for gpus, it absolutely makes sense for us to be using cloud there. also hardly any datacenters can support the power density, esp in downtown sf

trebligdivad 3 days ago [ - ]

Yeh; one other thing - you list a separate management network as an optional - it's not optional! Under no circumstance must you expose the managemnt IPs of switches or the servers to the internet; they are, on average, about as secure as a drunk politician. Use a separate management net, make sure it's only securly accessed.

Symbiote 3 days ago [ - ]

I understood that it's optional because they can walk down the road to the data center instead.

They mention plugging monitors in several times. I think I've only done that once in the last couple of years, when a firmware upgrade failed and reset the management interface IP.

g413n 2 days ago [ - ]

yep this. we just turned off management

giancarlostoro 3 days ago [ - ]

> Wth would you want public IPs.

So anyone can download 30 PB of data with ease of course.

pclmulqdq 3 days ago [ - ]

They didn't seem to want to use a router. Purpose-built 100 Gbps routers are a bit expensive, but you can also turn a computer into one.

flumpcakes 3 days ago [ - ]

Many switches are L3 capable, making them in effect a router. Considering their internet lines appear to be hooked up to their 100 Gbps switch, I'd guess this is one of the L3 ones.

buzer 3 days ago [ - ]

> Wth would you want public IPs.

Possibly to avoid needing NAT (or VPN) gateway that can handle 100Gbps.

xp84 3 days ago [ - ]

No DHCP doesn't mean public IPs nor impact the need for NAT, it just means the hosts have to be explicitly configured with IP addresses, default gateways if they need egress, and DNS.

Those IPs you end up assigning manually could be private ones or routable ones. If private, authorized traffic could be bridged onto the network by anything, such as a random computer with 2 NICs, one of which is connected eventually to the Internet and one of which is on the local network.

If public, a firewall can control access just as well as using NAT can.

buzer 3 days ago [ - ]

I know, I was specifically answering the question of "why the hell would you want public IPs".

I don't know why their network setup wouldn't support DHCP, that's extremely common especially in "enterprise" switches via DHCP forwarding.

xp84 3 days ago [ - ]

Ok then yes I agree with you. That was weird

bombcar 3 days ago [ - ]

I don't know what they're doing, but Mikrotik can perhaps route that → https://mikrotik.com/product/ccr2216_1g_12xs_2xq#fndtn-testr... and is about the cost of their used thing.

And I think this would be a banger for IPv6 if they really "need" public IPs.

dustywusty 3 days ago [ - ]

Exactly what I came in to say, CCR2216 can do this for < $2k, and does it well.

XorNot 3 days ago [ - ]

I mean generally above a certain size of deployment DHCP is much more trouble then it's worth.

DHCP is really only worth it when your hosts are truly dynamic (i.e. not controlled by you). Otherwise it's a lot easier to handle IP allocation as part of the asset lifecycle process.

Heck even my house IoT network is all static IPs because at the small scale it's much more robust to not depend on my home router for address assignment - replacing a smart bulb is a big enough event, so DHCP is solely for bootstrapping in that case.

At the enterprise level unpacking a server and recording the asset IDs etc is the time to assign IP addresses.

Symbiote 3 days ago [ - ]

I have static, public IPs across 80 or so servers.

It gets set approximately once when the server's automated Ubuntu installation runs, and I never think about it.

> Where does the switch choice come into whether you DHCP?

Perhaps from home routers which include I've.

> Wth would you want public IPs.

Why wouldn't you? They have a firewall.