Is there soon expanded explanation for "Of course it is biased! There’s no way to train the network otherwise!" ?
I'm still struggling to understand why is that the case. As far as I understand the training, in a bad case (probably mostly at the start) you could happen to learn the wrong gate early and then have to revert from it. Why isn't the same thing happening without the biasing to pass-thru? I get why pass-thru would make things faster, but not why it would prevent converging.
Thank you for the excellent writeup of some extremely interesting work! Do you have any opinions on whether binary networks and/or differentiable circuits will play a large role in the future of AI? I've long had this hunch that we'll look back on current dense vector representations as an inferior way of encoding information.
Well, I'm not an expert. I think that this research direction is very cool. I think that, at the limit, for some (but not all!) applications, we'll be training over the raw instructions available to the hardware, or perhaps even the hardware itself. Maybe something as in this short story[0]:
> A descendant of AutoML-Zero, “HQU” starts with raw GPU primitives like matrix multiplication, and it directly outputs binary blobs. These blobs are then executed in a wide family of simulated games, each randomized, and the HQU outer loop evolved to increase reward.
I also think that different applications will require different architectures and tools, much like how you don't write systems software in Lua, nor script games mods with Zsh. It's fun to speculate, but who knows.
Yes and no. I wasn't expecting to be able to reproduce the work, so I'm just content that it works. I was very surprised by how much hyperparameter finagling I had to do to get the DLGN converging; the tiny relu network I trained at the beginning, in comparison, converged with dead-simple SGD in a third of the epochs.
The speedup was surprising in the sense that the bit-level parallelism fell out naturally: that 64× speedup alone was unexpected and pretty sweet. There's likely still a lot of speed left on the table. I just did the bare minimum to get the C code working: it's single-threaded, there's no vectorization, lots of register spilling, etc. Imagine the speedup you'd get running the circuit on e.g. an FPGA.
But no, it was not surprising in the sense that yeah, multiplying billions of floats is going to be much slower than a handful of parallel bitwise ops. Physics is physics, doesn't matter how good your optimizer is.
You've made some mistakes with the Game of Life rules. You've missed out the overpopulation rule:
Any live cell with more than three live neighbours dies
Nit:
> I guess there’s a harsh third rule which is, “if the cell is dead, it stays dead”.
That phrasing is inaccurate, if a dead cell stayed dead, the first rule wouldn't work. I'm not sure that particular sentence adds much to the flow, honestly.
You're thinking about the cells as toggles on a stateful grid, TFA is thinking about them as pure functions that take in an input state and output a new state (with "off" being the default).
From that perspective, there's no point in "killing" a cell, it's simpler to only write out the 0 -> 1 and 1 -> 1 transition cases and leave all of the other cases as implicitly 0
Is it ok if I ask you for your age?
Is there soon expanded explanation for "Of course it is biased! There’s no way to train the network otherwise!" ?
I'm still struggling to understand why is that the case. As far as I understand the training, in a bad case (probably mostly at the start) you could happen to learn the wrong gate early and then have to revert from it. Why isn't the same thing happening without the biasing to pass-thru? I get why pass-thru would make things faster, but not why it would prevent converging.
That part about passthrough strongly reminded me of Turing’s Unorganized Machines (randomly wired NAND-gate networks): https://weightagnostic.github.io/papers/turing1948.pdf (worth a read from page 9)
Thank you for the excellent writeup of some extremely interesting work! Do you have any opinions on whether binary networks and/or differentiable circuits will play a large role in the future of AI? I've long had this hunch that we'll look back on current dense vector representations as an inferior way of encoding information.
Thank you, I'm glad you enjoyed it!
Well, I'm not an expert. I think that this research direction is very cool. I think that, at the limit, for some (but not all!) applications, we'll be training over the raw instructions available to the hardware, or perhaps even the hardware itself. Maybe something as in this short story[0]:
> A descendant of AutoML-Zero, “HQU” starts with raw GPU primitives like matrix multiplication, and it directly outputs binary blobs. These blobs are then executed in a wide family of simulated games, each randomized, and the HQU outer loop evolved to increase reward.
I also think that different applications will require different architectures and tools, much like how you don't write systems software in Lua, nor script games mods with Zsh. It's fun to speculate, but who knows.
[0]: https://gwern.net/fiction/clippy
how does ~300 gates you got compare to modern optimal implementations?
iirc it's around 30-40?
Was this result surprising?
Yes and no. I wasn't expecting to be able to reproduce the work, so I'm just content that it works. I was very surprised by how much hyperparameter finagling I had to do to get the DLGN converging; the tiny relu network I trained at the beginning, in comparison, converged with dead-simple SGD in a third of the epochs.
The speedup was surprising in the sense that the bit-level parallelism fell out naturally: that 64× speedup alone was unexpected and pretty sweet. There's likely still a lot of speed left on the table. I just did the bare minimum to get the C code working: it's single-threaded, there's no vectorization, lots of register spilling, etc. Imagine the speedup you'd get running the circuit on e.g. an FPGA.
But no, it was not surprising in the sense that yeah, multiplying billions of floats is going to be much slower than a handful of parallel bitwise ops. Physics is physics, doesn't matter how good your optimizer is.
what percentage of ops were passthru ?
ps. superb writeup and project
Thank you! Good question, Here are the NN stats, before lowering to C:
Note the rows aren't mutually exclusive.You've made some mistakes with the Game of Life rules. You've missed out the overpopulation rule:
Any live cell with more than three live neighbours dies
Nit: > I guess there’s a harsh third rule which is, “if the cell is dead, it stays dead”.
That phrasing is inaccurate, if a dead cell stayed dead, the first rule wouldn't work. I'm not sure that particular sentence adds much to the flow, honestly.
You're thinking about the cells as toggles on a stateful grid, TFA is thinking about them as pure functions that take in an input state and output a new state (with "off" being the default).
From that perspective, there's no point in "killing" a cell, it's simpler to only write out the 0 -> 1 and 1 -> 1 transition cases and leave all of the other cases as implicitly 0