Hacker News

Compiling a neural net to C for a speedup

287 points by todsacerdoti 2 days ago | 92 comments

Differentiable Logic Gate Networks [0] are super interesting. However, I still don't like that the wiring is fixed initially rather than learned.

I did some extremely rough research into doing learnable wirings [1], but couldn't get past even learning ~4-bit addition.

[0]: https://arxiv.org/abs/2210.08277

[1]: https://ezb.io/thoughts/program_synthesis/boolean_circuits/2...

hansvm 2 days ago [ - ]

One of the easier solutions that does the rounds periodically (in many forms above and beyond logic gates, such as symbolic regression) is just densely connecting everything and ensuring that an identity function exists. Anneal a penalty around non-identity nodes, use L1 for the penalty, and you can learn a sparse representation.

There are a number of details to work through, such as making an "identity" for 2 inputs and 1 output (just don't offer those; use gates like a half adder instead of AND or XOR, adding a post-processing step removing extra wires you don't care about), or defining "densely connected" in a way that doesn't explode combinatorially (many solutions, details only matter a little), but it's the brute-force solution, and you only pay the cost during training.

There are lots of other fun ways to handle that problem though. One of my favorites is to represent your circuit as "fields" rather than discrete nodes. Choose your favorite representation for R2->Rn (could be a stack of grids, could be a neural net, who cares), and you conceptually represent the problem as a plane of wire density, a plane of "AND" density, a plane of "XOR" density, etc. Hook up the boundary conditions (inputs and outputs on the left and right side of the planes) and run your favorite differentiable PDE solver, annealing the discreteness of the wires and gates during training.

mochomocha 2 days ago [ - ]

Ha! I have spent the last 2 years on this idea as a pet research project and have recently found a way of learning the wiring in a scalable fashion (arbitrary number of input bits, arbitray number of output bits). Would love to chat with someone also obsessed with this idea.

andy12_ 2 days ago [ - ]

I also I'm very interested. I had played around a lot with Differentiable Logic Networks a couple of months ago and how to make the learned wiring scale to bigger number of gates. I had a couple of ideas that seemed to worked in a smaller scale, but that had trouble converging with deeper networks.

UncleOxidant 2 days ago [ - ]

Also very interested. Do you have any code on github?

mattdesl 2 days ago [ - ]

I think the techniques in “Weight Agnostic Neural Networks” should be applicable here, too. It uses a variant of NEAT I believe. This would allow for learning the topology and wiring rather than just gates. But, in practice it is probably pretty slow, and may not be all that different than a pruned and optimized DLGN..

https://weightagnostic.github.io/

jimkoen 2 days ago [ - ]

To ruin it for everyone: They're also patented :) https://patents.google.com/patent/WO2023143707A1/en?inventor...

Lerc 2 days ago [ - ]

What's the innovation here?

Using logic operators? Picking something from a range of options with SoftMax? Having a distribution to pick from?

I remember reading about adaptive boolean logic networks in the 90's. I remember a paper about them using the phrase "Just say no to backpropagation". It probably goes back considerably earlier.

Fuzzy logic was all the rage in the 90's too. Almost at the level of marketers sticking the label on everything the way AI is done today. Most of that was just 'may contain traces of stochasticity' but the academic field used actual defined logical operators for interpolated values from zero to one.

A quick look on picking from a selection found https://psycnet.apa.org/record/1960-03588-000 but these days softmax is just about ubiquitous.

jimkoen 2 days ago [ - ]

> What's the innovation here? > Having a distribution to pick from?

As I understand it, it's exactly this. Specifically, representing neurons in a neural network via a probability distribution of logic gates and then collapsing the distribution into the optimal logic gate for a given neuron via hyper-parameter tuning in the form of gradient descent. The author has a few more details in their thesis:

https://arxiv.org/abs/2209.00616

Specifically it's the training approach that's patented. I'm glad to see that people are trying to improve on his method, so the patent will likely become irrelevant in the future as better methods emerge.

The author also published an approach on applying their idea onto convolutional kernels in CNN's:

https://arxiv.org/abs/2411.04732

In the paper they promise to update their difflogic library with the resulting code, but apparently they seem to have conveniently forgotten to do this.

I also think their patent is too broad, but I guess it speaks for the entire ML community that we haven't seen more patents in this area. I could also imagine that, given that the approach promises some very impressive performance improvements, they're somewhat afraid that this will be used for embedded military applications.

sitkack 2 days ago [ - ]

Liquid NN are also able to generate decision trees.

genewitch 2 days ago [ - ]

My Zojirushi rice cooker says fuzzy logic on it, it's 15 years old, so that phrase was still marketed 15 years after "inception".

JonChesterfield 2 days ago [ - ]

If you replace the uint64_t cell with an attribute((vector_size(32))) and build with march=native, the bitwise ops will work exactly as before but you'll light up the vector units on the x64 machine.

Good blog post, thanks!

isaacimagine 2 days ago [ - ]

Glad you enjoyed it, and thanks for the tip!

nine_k 2 days ago [ - ]

The interesting thing here is that it's not a straightforward port. JAX is already very fast, for the architecture it implements. The point is that the network is heavily contracted by removing nodes that only do pass-through, and then hugely parallelizing the computations using bitwise operations on 64 bits at once. Hence this incredible speedup.

gwern 2 days ago [ - ]

How much of a speedup is the C compiler optimization able to achieve in terms of compiling it down to a hand-written C equivalent vs the -O0 non-optimized assembler? What does the optimized C/assembler do which isn't actually necessary and accounts for the remaining inefficiency?

isaacimagine 2 days ago [ - ]

There are 163 lines of C. Of them, with -O3, 104 lines are present in the assembly output. So the C compiler is able to eliminate an additional ~36.2% of the instructions. It doesn't do anything fancy, like autovectorization.

I profiled just now:

          | instrs (aarch64) | time 100k (s) | conway samples (%) | 
    | -O0 |              606 |        19.10s |             78.50% |
    | -O3 |              135 |          3.45 |             90.52% |

The 3.45s surprises me, because it's faster than the 4.09s I measured earlier. Maybe I had a P core vs an E core. For -O0, the compiler is emitting machine code like:

    0000000100002d6c ldr x8, [sp, #0x4a0]
    0000000100002d70 ldr x9, [sp, #0x488]
    0000000100002d74 orn x8, x8, x9
    0000000100002d78 str x8, [sp, #0x470]

Which is comically bad. If I try with e.g. -Og, I get the same disassembly as -O3. Even -01 gives me the same disassembly as -O3. The assembly (-0g, -01, -03) looks like a pretty direct translation of the C. Better, but also nothing crazy (e.g. no autovectorization):

    0000000100003744 orr x3, x3, x10
    0000000100003748 orn x1, x1, x9
    000000010000374c and x1, x3, x1
    0000000100003750 orr x3, x8, x17

Looking more closely, there's actually surprisingly little register spilling.

I think the real question you're asking is, as I wrote:

> If we assume instruction latency is 1 cycle, we should expect 2,590 fps. But we measure a number nearly 10× higher! What gives?

Part of this is due to counting the instructions in the dissassembly wrong. In the blogpost I used 349 instructions, going off Godbolt, but in reality it's 135. If I redo the calculations with this new numbers, I get 2.11 instructions per bit, 0.553 million instrs per step, dividing out 3.70 gcycles/s gives 6,690 fps. Which is better than 2,590 fps, but still 3.6x slower than 24,400. But I think 3.6x is a factor you can chalk up to instruction-level parallelism,.

Hope that answers your questions. Love your writing Gwern.

gwern 2 days ago [ - ]

Thanks for checking. It sounds like the C compiler isn't doing a great job here of 'seeing through' the logic gate operations and compiling them down to something closer to optimal machine code. Maybe this is an example of how C isn't necessarily great for numerical optimization, or the C compiler is just bailing out of analysis before it can fix it all up.

A fullstrength symbolic optimization framework like a SMT solver might be able to boil the logic gates down into something truly optimal, which would then be a very interesting proof of concept to certain people, but I expect that might be for you an entire project in its own right and not something you could quickly check.

Still, something to keep in mind: there's an interesting neurosymbolic research direction here in training logic gates to try to extract learned 'lottery tickets' which can then be turned into hyper-optimized symbolic code achieving the same task-performance but possibly far more energy-efficient or formally-verifiably.

JonChesterfield 2 days ago [ - ]

Something like this should be hitting the instruction level vectoriser, the basic block at a time one, nearly bang on. Its a lot of the same arithmetic op interleaved. It might be a good test case for llvm - I would have expected almost entirely vector instructions from this.

isaacimagine 2 days ago [ - ]

z3 has good python bindings, which I've messed around with before. My manual solution uses 42 gates, I would be interested to see how close to being optimal it is. I didn't ask the compiler to vectorize anything, doing that explicitly might yield a better speedup.

Re:neurosymbolics, I'm sympathetic to wake-sleep program synthesis and that branch of research; in a draft of this blog post, I had an aside about the possibility of extracting circuits and reusing them, and another about the possibility of doing student-teacher training to replace stable subnets of standard e.g. dense relu networks with optimized DLGNs during training, to free up parameters for other things.

gomoboo 2 days ago [ - ]

Relevant post from a few years ago: https://news.ycombinator.com/item?id=25290112

“ NN-512 is an open-source Go program that generates fully AVX-512 vectorized, human-readable, stand-alone C implementations of convolutional neural nets”

Vox_Leone 2 days ago [ - ]

Well done — really enjoyed this. We could use this kind of optimization in our library[0], which builds differentiable logic networks out of gates like AND, XOR, etc.

It focuses on training circuit-like structures via gradient descent using soft logic semantics. The idea of compiling trained models down to efficient bit-parallel C is exactly the kind of post-training optimization we’ve been exploring — converting soft gates back into hard boolean logic (e.g. by thresholding or symbolic substitution), then emitting optimized code for inference (C, WASM, HDL, etc).

The Game of Life kernel is a great example of where logic-based nets really shine.

[0]https://github.com/VoxLeone/SpinStep/tree/main/benchmark

andy12_ 2 days ago [ - ]

I also worked a long time ago in recreating the original Deep Differentiable Logic Network paper [1], so I have a couple of additions to make.

> I wanted to see if I could learn the wires in addition to the gates. I still think it’s possible, but it’s something I had to abandon to get the model to converge.

Actually, I read some other paper where they also learned the wiring, but they did so by alternating the training of the gates and the wires (in some iterations they learned the wiring while keeping the gates frozen, and in other they learned the gates while keeping the wiring frozen). The problem with this approach is that it is inherently non-escalable: you need a lot of gates to approximate the behavior of a simple MLP, and if you need a full NxM learned matrix to encode the wiring, the memory needed to learn, for example, MNIST, gets huge, quickly. I think that for this there are 2 fixes:

- You actually don't need to learn a full NxM matrix to increase the expressivity of the network. You can, for each output gate, select a random subset of possible input gates of size K, and then you only need a learned matrix of size KxM. I did the numbers, and even a moderately small K, like 16 or 32, wildly increases the number of circuits you can learn with a smaller number of layers and gates.

- You could use a LoRA kind of matrix. Instead of a matrix NxM, use a pair of matrices NxK and KxM, where K<<N,M.

Learning the wiring also has other benefits. As the output gate can learn to swap the inputs if needed, you can remove some learnable gates that are "mirrors" or "permutations" of each other (a and not b, not a and b; a or not b, not a or b), which can help scale the networks to use gates of more inputs (I tried with 3-input gates and 4-input gates).

Also, as the author pointed out, it was very difficult to get the models to converge. It was very frustrating that I never managed to get a working model that performed really well on MNIST. In the end, I gave up on that and I worked on how to make the network consistently learn simple 3-input or 4-input functions with perfect accuracy, and I managed to make it learn them consistently with a couple dozen iterations, which was nice.

[1] https://arxiv.org/abs/2210.08277

isaacimagine a day ago [ - ]

Very cool, thank you for sharing!

isaacimagine 2 days ago [ - ]

Author here. Any questions, ask away.

viraptor 2 days ago [ - ]

Is there soon expanded explanation for "Of course it is biased! There’s no way to train the network otherwise!" ?

I'm still struggling to understand why is that the case. As far as I understand the training, in a bad case (probably mostly at the start) you could happen to learn the wrong gate early and then have to revert from it. Why isn't the same thing happening without the biasing to pass-thru? I get why pass-thru would make things faster, but not why it would prevent converging.

mlajtos a day ago [ - ]

That part about passthrough strongly reminded me of Turing’s Unorganized Machines (randomly wired NAND-gate networks): https://weightagnostic.github.io/papers/turing1948.pdf (worth a read from page 9)

GloamingNiblets 2 days ago [ - ]

Thank you for the excellent writeup of some extremely interesting work! Do you have any opinions on whether binary networks and/or differentiable circuits will play a large role in the future of AI? I've long had this hunch that we'll look back on current dense vector representations as an inferior way of encoding information.

isaacimagine 2 days ago [ - ]

Thank you, I'm glad you enjoyed it!

Well, I'm not an expert. I think that this research direction is very cool. I think that, at the limit, for some (but not all!) applications, we'll be training over the raw instructions available to the hardware, or perhaps even the hardware itself. Maybe something as in this short story[0]:

> A descendant of AutoML-Zero, “HQU” starts with raw GPU primitives like matrix multiplication, and it directly outputs binary blobs. These blobs are then executed in a wide family of simulated games, each randomized, and the HQU outer loop evolved to increase reward.

I also think that different applications will require different architectures and tools, much like how you don't write systems software in Lua, nor script games mods with Zsh. It's fun to speculate, but who knows.

[0]: https://gwern.net/fiction/clippy

NooneAtAll3 2 days ago [ - ]

how does ~300 gates you got compare to modern optimal implementations?

iirc it's around 30-40?

djmips 2 days ago [ - ]

Was this result surprising?

isaacimagine 2 days ago [ - ]

Yes and no. I wasn't expecting to be able to reproduce the work, so I'm just content that it works. I was very surprised by how much hyperparameter finagling I had to do to get the DLGN converging; the tiny relu network I trained at the beginning, in comparison, converged with dead-simple SGD in a third of the epochs.

The speedup was surprising in the sense that the bit-level parallelism fell out naturally: that 64× speedup alone was unexpected and pretty sweet. There's likely still a lot of speed left on the table. I just did the bare minimum to get the C code working: it's single-threaded, there's no vectorization, lots of register spilling, etc. Imagine the speedup you'd get running the circuit on e.g. an FPGA.

But no, it was not surprising in the sense that yeah, multiplying billions of floats is going to be much slower than a handful of parallel bitwise ops. Physics is physics, doesn't matter how good your optimizer is.

jgord 2 days ago [ - ]

what percentage of ops were passthru ?

ps. superb writeup and project

isaacimagine 2 days ago [ - ]

Thank you! Good question, Here are the NN stats, before lowering to C:

    total gates        | 2303 | 100.0%
    -------------------+------+-------
    passthrough        | 2134 |  92.7%
    gates w/ no effect | 1476 |  64.1%

Note the rows aren't mutually exclusive.

Twirrim 2 days ago [ - ]

You've made some mistakes with the Game of Life rules. You've missed out the overpopulation rule:

Any live cell with more than three live neighbours dies

Nit: > I guess there’s a harsh third rule which is, “if the cell is dead, it stays dead”.

That phrasing is inaccurate, if a dead cell stayed dead, the first rule wouldn't work. I'm not sure that particular sentence adds much to the flow, honestly.

nightpool 2 days ago [ - ]

You're thinking about the cells as toggles on a stateful grid, TFA is thinking about them as pure functions that take in an input state and output a new state (with "off" being the default).

From that perspective, there's no point in "killing" a cell, it's simpler to only write out the 0 -> 1 and 1 -> 1 transition cases and leave all of the other cases as implicitly 0

kookamamie 2 days ago [ - ]

> 1,744x speedup

Is that 1744x or 1.7x?

genewitch 2 days ago [ - ]

Former, also there's too many digits of precision for it to be the latter.

thirtygeo 2 days ago [ - ]

That approach is bananas! I had seen the source inspiration paper from Google but it's need to see it replicated and extended so shortly after.

isaacimagine 2 days ago [ - ]

+10 respect, thank you <3

AmazingTurtle 2 days ago [ - ]

I recently read about DLGAs on HN and instantly thought: damn thats some hot take. But I was too stupid to implement it from the paper. Glad you got it working and documented it! Thanks!

hermitShell 2 days ago [ - ]

This is very fascinating as a limit case, which always serve as a good example of the bound. I think it highlights that “efficiency isn’t everything” just like in so many other systems like healthcare and justice. In this case we could figure out the activation functions by analysis, which is impossible for problems of higher dimensionality. The magic of AI isn’t in it’s efficiency, it’s in making things computable that simply aren’t by other means.

memming 2 days ago [ - ]

pretty cool write up. the interesting bits were before what the title indicated though.

jjaksic 2 days ago [ - ]

Cool. If you do this with an LLM, someone will pay you a lot of money.

godelski 2 days ago [ - ]

  > I tried something new for the first time, which was to keep a journal during development.

DO THIS!!!

I cannot stress this enough!

If you work in a professional science lab, say, physics, biology, chemistry, you are expected to keep an experiment journal. It provides more help to you than the company too (knowledge dump, liability, etc). I can't tell you how many times some stupid ass seemingly benign comment saved my behind. They're worth their weight in gold.

For ML experiments I use wandb and hydra[0]. Put all your configs into hydra. Be fucking pedantic. You should log your seeds, versions, the date, and I mean everything. It only takes a few extra minutes to set this up but the one time you need it it'll save you hours. Dump all that into wandb AND your model checkpoints. You will forget what that checkpoint corresponds to. Make liberal use of wandb tags and comments (through hydra you can make these cli arguments to automate even if launching from slurm scripts). Turn on wandb's code saving.

Most importantly, use those notebooks wandb gives you. Don't worry if it gets messy. It's a experiment notebook, it'll get messy. You'll get better with experience and as you find your style.

It sounds like a lot of work but it really isn't. You can get this all done under 20 minutes and if you write it right you can just copy paste it moving forward (i.e. yeah, make a personal library). I can PROMISE you that one mishap will far outweigh this extra work. You look like a pretentious perfectionist but really I'm a lazy piece of shit rust doesn't want to spend hours or days debugging some stupid mistake I'm too dumb or tired to catch. The extra benefit is when shit world you can spin up some (wandb) sweeps and go do some other thing that's always behind.

(On topic, stop using personal wandb accounts for your work experiments. They're like the best company out there, get your boss to pay. They provide an amazing service and are a delight to work with. I cannot speak highly enough about them. They're not the company you want to mooch from. I've literally seen this happen while working for a top 3 market cap which was already paying for seats and you just needed to send a slack message to one dude... not cool guys... not cool...)

[0] https://hydra.cc/docs/intro/

mr_toad 2 days ago [ - ]

Same goes for something as simple as setting up a server. You will forget, and if you don’t write it down you’ll have to figure it out again.

godelski 2 days ago [ - ]

Where did I put those configs again? Where did Bob put that script? Fuck, why didn't I write an ansible script. It's never a one off, and it serves as documentation. I'll remember after I make the same mistake next time.

Also, environment modules for the win

https://modules.readthedocs.io/en/latest/

eptcyka 2 days ago [ - ]

Use nix and you’ll at least will be able to deploy the same again.

genewitch 2 days ago [ - ]

So what happens in nix if a package version is found to have some huge vulnerability? And my app expects that exact version?

Because this is literally what docker, venv, and nix claim to fix but after getting burned by three other systems I'm not willing to invest the time in getting in to nix.

I keep machines powered off that have a working configuration of older AI and other software tools because there is no other way to run them, regardless of the code being available on github.

godelski a day ago [ - ]

There's other solutions too, this is a big part of systemd. Which also has nspawn and vmspawn do more explicitly. But everything has some containerization capabilities and ideally you'd give your program access to only what it needs. Privatetmp should always be on.

But if versions are vulnerable you usually want to remove those versions, not put them in containers

genewitch a day ago [ - ]

> remove those versions, not put them in containers

I don't know how to fix this, but perhaps i can ai it and release something on my github if i manage to cobble something together.

These aren't "services" that anyone has access to, except myself; "clients", UIs, and things like whisper.

IF someone were to pay me, I'd figure it out. I'm friends with maintainers and that isn't my style. archiving is.

to wit, i expend no more energy than necessary maintaining other people's code.

eptcyka 2 hours ago [ - ]

I really do not follow what you are trying to convey here.

If there are vulns, and you are using software from nixpkgs, there are tools to get yourself notified about vulnerable packages.

If you want to run vulnerable software on-demand, you can just boot the machine/vm up when needed? If you want to patch stuff yourself, nix makes it trivial to apply your own patches to already packaged software.

stirfish 2 days ago [ - ]

One time my boss asked me to upgrade some servers and was surprised I put all my work in a script in version control. Then there was a second batch of servers.

neilv 2 days ago [ - ]

If your work is tracked in an issue-tracking system, you can put your in-progress thinking notes as comments there (if they don't go in the code or some other artifact).

It helps to have a safe environment, among whomever might access that issue comment history. If people don't feel safe exposing their thought process, then they won't do it, or they'll be stressed by presenting vulnerability, and even modify their problem-solving for appearances.

(I have some more complicated options involving a wiki, but explaining requires too much context. The issue-tracking comments solution is obvious.)

What I try not to do is to introduce new places that important information goes. If you don't rein this in, there will be an explosion of employees plastering your IP all over a bunch of random SaaSes, to be undiscovered or even lost to your company entirely (also, those other SaaS companies and hackers might get more use out of stealing your IP than you do).

zeroCalories 2 days ago [ - ]

Blame -> pr -> issue is a great way to learn a codebase if your team is good about keeping a log of their work, which they generally should be.

isaacimagine 2 days ago [ - ]

Chaotic energy haha, I like it. Thanks for the tips re: keeping a journal, I will do this more in the future. I usually keep development notes, though normally in markdown files scattered across the codebase or in comments, never by date in the README. In the future, I might make JOURNAL.md a standard practice in my projects? re:w&b, I used w&b when it first came out and I liked it but I'm sure it's come a lot further in the time since then. I will have to take a look!

Also lol "pretentious perfectionist" I'm glad to finally have some words to describe my design aesthetic. I like crisp fonts, what can I say.

godelski 2 days ago [ - ]

  > Chaotic energy haha, I like it

My boss says I'm eccentric. I say that's just a nice word for crazy lol

> normally in markdown files scattered across the codebase or in comments

I used to do that too but they didn't end up helping because I could never find them. So I moved back to using a physical book. The wandb reports was the first time I really had something where I felt like I got more out of it than a physical book. Even my iPad just results in a lot of lost stuff and more time trying to figure out why I can't just zoom in on the notes app. I mean what is an iPad even for if it isn't really good for writing?

But the most important part of the process I talked about is the logging of all the parameters and options. Those are the details you tend to lose and go hunting for. So even if you never write a word you'll see huge benefits from this.

  > re:w&b

Wandb's best feature is that you can email them requesting a feature and they'll implement it or help you implement it. It's literally their business model. I love it. I swear, they have a support agent assigned to me (thanks Art! And if wandb sees this, give the man a raise. Just look at what crazy people he has to deal with)

  >  lol "pretentious perfectionist" I'm glad to finally have some words to describe my design aesthetic

To be clear, I'm actually not. Too chaotic lol. Besides, perfectionism doesn't even exist. It's more a question about personal tastes and where we draw the line for what is good enough. I wish we'd stop saying "don't let perfectionism get in the way of good" because it assumes like there's universal agreement about what good enough is.

isaacimagine 2 days ago [ - ]

Parameters and options, got it. I try to keep all configuration declarative and make building and running as deterministic as possible. Then I can commit whenever I do something interesting, that I can just checkout to revisit.

godelski 2 days ago [ - ]

I think these are the two main headaches with experimenting. No matter what kind of experiment you're doing (computation, physics, chem, bio, whatever)

  - Why the fuck aren't things working
  - Why the fuck are things working

The second is far more frustrating. The goal is to understand and explain why things are the way they are. To find that causal structure, right? So in experimenting, getting things working means you're not even half way done.

So if you are "organized" and flexible, you can quickly test different hypotheses. Is it the seed? The model depth? The activation layers? What?

Without the flexibility it gets too easy to test multiple things simultaneously and lose track. You want to isolate variables as much as possible. Variable interplay throws a wrench into that so you should make multiple modifications at once to optimally search through configuration space but how can you do any actual analysis if you don't record this stuff. And I guarantee you'll have some hunch and be like "wait, I did something earlier that would be affected by that!" and you can go check to see if you should narrow down on that thing or not.

The reason experimenting is hard is because it is the little shit that matters. That's why I'm a crazy pretentious "perfectionist". Because I'm lazy and don't have the budgets or time to be exhaustive. So free up your ability so you can quickly launch experiments and spend more time working on your hypotheses, because that task is hard enough. You don't want to do that while also having to be debugging and making big changes to code where you're really just going to accidentally introduce more errors. At least that's what happens to my dumb ass, but I haven't yet met a person that avoids this, so I know I'm not alone.

lairv 2 days ago [ - ]

Converged to something similar after spending 2 days bissecting a repo to reproduce a training run, having to wait 3hr on each commit before conclusive results. I couldn't get myself to use hydra though, it felt like a lot of bloat vs loading a yaml with pydantic

nine_k 2 days ago [ - ]

The problem with keeping a journal is that the distraction of doing so may break the state of flow.

OTOH there are natural breaks in the process of working; writing things down during these works fine. The fidelity is a bit lower, but it's still much better than nothing.

godelski 2 days ago [ - ]

  > the distraction of doing so may break the state of flow.

Sure, but like you said, don't do it when in the state of flow.

Or better, make it part of your flow state. To me, it is part of my flow state, so not a real issue.

I mean whatever works for you. You gotta time manage and I can't manage for you. I'm sure your boss is asking for more writeups than I am and just send them your notes. They don't care whats in it half the time, they just don't know how to figure out if you're working or not and just want something.

Hell, we're on HN on a workday... I can guarantee you aren't in a flow state the whole time and can't be bothered with a few minutes to write some stuff down. I mean you have to eat and go to the bathroom, right?

isaacimagine 2 days ago [ - ]

Agree, it's much better to write up a journal at times when your colleagues would be https://xkcd.com/303

0cf8612b2e1e 2 days ago [ - ]

From my two minute skim of the docs, not encouraging that hydra only officially supports up to Python 3.11.

godelski 2 days ago [ - ]

I use it in python 3.12, and 3.12 just got out of bug fix. I haven't tried 3.13 but I would be surprised if there was a break. Most of it works through OmegaConf[0].

Idk why they haven't pushed an update in 2 years but neither has this been a problem. FWIW, they're still updating the repo[1]

[0] https://omegaconf.readthedocs.io

[1] https://github.com/facebookresearch/hydra

tomcam 2 days ago [ - ]

OK so I try to do that. But then I'll have some big problem or add too many big features and just give up. (My sleep is nearly nonexistent so I don't have a lot of time for logging things anyway.)

godelski 2 days ago [ - ]

You might be doing it wrong. Make your code more modular. Think about how powerful functions actually are. But also make them simple and self contained. All those programming books suggest this not because "it's pretty", "good form", or whatever. You do this because you know that things never work out like you expect them to. Only a naive programmer or a literal god thinks they'll get the program right on the first go. Even just the fact that the world changes underneath our feat means it will change.

So you write expecting things to change. You write so a change can be added quickly and not break everything else. You write so you don't have to pull apart a bunch of tangled mess. There's a lot of complexity no matter what, so even a little goes a long way.

  > My sleep is nearly nonexistent so I don't have a lot of time for logging things anyway

I'll make a bet.

I'll bet that if you log you'll get more sleep. This is a classic negative feedback loop and is honestly why I started doing this in the first place. A little extra work upfront saves me a lot of work down the line. You need to be concerned with today but that doesn't mean you can ignore tomorrow.

The point of my strategy is in how positive things compound. A little here, a little there, do this for a bit and you got something beautiful while it seems like you did no extra work (because you spread it out)

But the negative effects compound too. They create more work. The less sleep you get the more mistakes you make. Worse, the more subtle hard to catch mistakes you make! You just end up missing sleep chasing down bugs and issues introduced because you wrote while being sleep deprived. We all do this! But we need to recognize it and try to break this cycle as soon as we recognize it happening.

My bet is you are the one creating most of the work that is keeping you up and making you feel over burdened.

My bet is if you take a break you'll actually get more done.

My bet is you're caught in a destructive loop.

I'll make this bet because I have so much experience with this same self-destructive behavior. Been there. Done that. I don't want to be there nor do I want you to be. But to get out, you have to fight that impulse that got you there in the first place.

heavyset_go 2 days ago [ - ]

My current workflow is to keep a wiki, would you say hydra would replace/complement especially that if you're used to note keeping the wiki way?

godelski 2 days ago [ - ]

Hydra is part of the documentation process imo. Truthfully, the most important stuff that goes in your experiment journal is all those pesky parameters and things that can surprisingly change results.

So I love that hydra uses OmegaConf and I essentially get 3 copies: the experiment config yaml, the wandb log, a dictionary in the checkpoint. Multiple times my dumbass has had to try to match the checkpoint to the wandb log, so the redundancy is incredibly helpful. Sometimes just a library version has unexpected changes on performance and this makes it trivial to trace. The yaml file is more helpful when passing off the code to someone else or releasing to public.

So yeah, I would say that it'll benefit no matter how you document. Use whatever documentation method works for you. Reports can still offer some benefits in just throwing some charts together quickly and organizing but I think you'd still benefit from hydra. It's too easy to lose track of those little things and this helps me automate. But you can also just straight up use OmegaConf or even dictionaries. Whatever works for you.

The real help is logging. So whatever tools help you log, use them. This is just what I benefit from (there's a lot I can talk about too and I'd love to see what others do as well)

xico 2 days ago [ - ]

Hydra and omegaconf are almost officially abandonware, you shouldn't really depend on them.

chairmansteve 2 days ago [ - ]

Anyone use a digital notebook, like the reMarkable, for this kind of thing?

thyristan 2 days ago [ - ]

Started but stopped. most of the things are commands, code snippets, urls, all of which are tedious to hand-write and just easier to copy&paste. Often I do 'typescript' for shell sessions, asciinema or stuff like that, and file those.

Also, use git, commit everything, never care about doing tidy commits, just commit commit commit and use tons of branches to try out stuff. if you need clean history later on, you can always do interactive rebase, squash merge or whatever. but having a documentation of all the things tried and failed is far more important.

Xss3 2 days ago [ - ]

I thought the entire point was that it syncs cross platform really quickly and lets you have the best of both worlds?

thyristan 2 days ago [ - ]

Only handwriting in a proprietary format. It isn't at all like one would wish for. It works as a replacement for a paper notebook. But it largely ignores the things one could do when adding more digital embeddings. In that way it is even worse than OneNote.

If you want to do something like that, my recommendation would actually be something like OneNote on some Windows tablet.

crubier 2 days ago [ - ]

Work logs are generalized at my company and they are AWESOME

rurban 2 days ago [ - ]

In my company also. We put everything into our github issues. Some things also into the README.md or the github wiki, but it's usually lost there

rvz 2 days ago [ - ]

All of this agreed.

Now in the age of AI, many students entering into CS need do this NOW, otherwise any answer they come up in the interview, will be assumed that it was from an AI and they need to show that they something useful came out of their blogpost or research.

It is what it now means to know how to experiment, understand and build knowledge, rather than spitting out the answer because it it from stack overflow or ChatGPT.

The mistakes are raw, all the learnings in a blog post which is what makes us human yet, 90% of candidates do not do this which is why most of them cannot explain an AI's mistakes an interview if they use it.

godelski 2 days ago [ - ]

I actually really like this idea. I've often found it odd we don't show off reports or how we run experiments during interviews. Certainly this has far greater influence over your aptitude than leetcode.

  > 90% of candidates do not do this or cannot explain AI's mistakes an interview.

I have a growing concern that people do not see mistakes. This seems to be a bigger divide than "uses AI to code" vs "doesn't".

rolandhvar 2 days ago [ - ]

So here's the thing I struggle with. I do a lot of work in jupyter notebooks. I come up with a new model or approach to some problem, and I want to fork out and test a hypothesis in the background (which might be some set of hyperparameters, and might take several minutes, or hours; call it Run A) while continuing to work down some other path in the same notebook, and maybe kick off a Run B that explores some other change (like a restructure of the code that's not "compatible" with the hyperparameter search of Run A).

Then at some point when Run A finishes, I want to incorporate the changes I made in Run B and kick off Run C, and so on.

The hard/important things are:

1) Being able to do this while staying in a Jupyter notebook context the whole time. Even something as simple as multiprocessing sucks because I've found it's too hard to manage in a Jupyter context (e.g. how do you handle where stdout and stderr go?). It's easier if you move to scripts where you have full support for this sort of thing and you are expecting to look at multiple log files on disk and whatnot.

Also the sequential nature of notebooks doesn't help when you want to occasionally fork out or conditionally run stuff.

2) Keeping track of all these changes and hypotheses and merging the results/code together as you learn. It's like you need a VCS for your hypotheses. Maybe hydra & wandb help with that, I haven't used them. But this idea of keeping track of hypotheses seems like the more fundamental thing.

3) The main reason I prefer to stay in a notebook context is because I have all my objects easily accessible. My models, all my dataframes, functions to do some ad-hoc charting etc, all super easy to access in a REPL-like form. That is invaluable for doing ad-hoc sanity checks or digging/drilling down. So a big part of the workflow is you basically have this in-memory database of a bunch of relevant objects and you're querying it and constructing new objects & visualisations using Python as your tool, without having to load things from disk or build up the context from scratch. It's all "just there".

4) And then sometimes you want to take the results X1 of that notebook and plot them against some entirely different set of data X2 that requires a whole bunch of other code that you've defined in some other notebook somewhere, or maybe even as a real Python module. Like maybe that data lives in a database and you transform it or something. So OK, you call some functions to load X2 within your original notebook, but BOOM you get an OOM and you're like ok now I have to write some code to serialise X1 to disk, and make YET ANOTHER notebook so I can go analyze X1 and X2. It all just seems so... unnecessary, if only the right tooling existed.

My current best approach is to use semantic versioning on the filename, just copy the whole notebook each time I make a fundamental change, and try to keep track of my hypotheses, preconditions, learnings etc within comments and have a few of those on the go running, but it's often hard to engage in critical thinking when everything you know is sprawled across multiple notebooks.

Maybe a simple global journal is the only thing for this sort of use case. And that doesn't even address (4) which is often a huge pain point. Can anyone think of something better?

analog31 2 days ago [ - ]

For me, those side-investigations are often physical experiments, which run on their own time scale. Plus they often run on another computer, to reduce the risk of crashing and losing data, or just physical proximity to the experiment.

How I tie those threads together is by the data that they generate. I use ASDF because it works for the kind of stuff I'm going, but choose your poison. Once the data are in the bag, the cells that analyze or report the results can stay in the same notebook, or be copied into your main notebook. My data aren't so huge that there's much of a penalty in re-loading them.

For me, reproducibility is more important than organization, because I'm not all that organized anyway. So, a single master notebook at the end of a study isn't my top goal.

godelski 2 days ago [ - ]

  > I do a lot of work in jupyter notebooks.

I don't think I'll have good advice for you if you want to use jupyter notebooks, hopefully someone else will. I *hate* notebooks. I think they are great for reports or for demos (especially when teaching), but I honestly do not get how people use them in research or general programming.

I will use vim and ipython thought. If on a remote machine I'll use tmux (preserve sessions) but local I use Ghostty[0]. I can iterate through code this way without the notebook and am far less likely to get caught up by with out of order executions. I can get all that ad-hoc, persistent memory, auto-updating function benefits with ipython (autoreload). But vim and ipython don't require me to lose my modularity, organization, automated record keeping, and the rest.

If it works for you, keep it! I'm just saying what works for me (any switch will cause some disruption). But I do want to stress that there's tons of other options for keeping things in memory without using jupyter notebooks (pdb is also a fantastic tool!). But also be careful because persistent memory can easily bite you in the ass too. Easy to forget what's still in memory. I'll also add, that having also been the person that maintains our lab's compute systems, I'm wildly annoyed with notebooks and VSCode users leaving their workloads in memory. This is a user thing more than a tool thing but there's a tendency here and it eats up resources that other people need. Just make sure to disconnect when you leave the desk.

  > I want to fork out and test a hypothesis in the background

But my process does greatly help with this! IMO you should be trying to run experiments in parallel. Operating in this style there's no forking, you're just launching another job. I like using a job launcher like slurm when I have multiple machines but just a simple bash script to launch is often more than good enough.

The point is to not fork. You should clone. With changes, I suggest using git branches. But if your code is written to be modular and flexible it is often really quick and easy to add new functions to handle different tasks, add new types of measurements, or whatever.

The two big reasons to write like I do is that

  1) It is (partially) self documenting. You don't have to think about writing down and remembering all your hyperparameters. I'm going to forget and so I need to automate that to prevent this
  2) I'm running experiments! I may be dumb, but I'm not so dumb I think I am not going to change details of my experiments as the project matures.

That's why I say it is about being lazy. I'm writing the way I do because I know that whether it is tomorrow, next week, or 6 months from now, I'm going to need to make changes that are going to make things very different from where they are today. I don't think of it so much as having foresight about the future so much as I'm just frustrated at having to constantly dig myself out of a hole and this makes that a lot easier and lets me get back to the fun exploration stuff faster. It is 100% about having version control over my hypotheses and experiments.

So I'd argue you should move away from notebooks and use other better tools more suited for the job. It'll definitely cause disruption and you're definitely going to be slower at first but find what works for you. The reason people love tools like vim or love working in the cli is because they are modifiable. There's no one tool that works for everyone. I'm not sure there's even a tool that out of the box works for any one person (maybe the original dev?)! But there's a ton of power in having tools which I can adapt to me and the way I work. I can make it help me catch my common mistakes and highlight things I care about. You don't need to spend hours doing this stuff. It develops over time. But go into any workshop and you'll see that everyone has modified the tools for them. We're programmers, we have way more flexibility over customization than people working with physical stuff. Use that to your advantage. And truthfully, you should see how that idea becomes circular here. I'm just designing my code and experiments to be like my tools: environments to be shaped.

[0] https://ghostty.org/

hy4000days 2 days ago [ - ]

[dead]

2 days ago [ - ]

[deleted]

Lopezkatheryn 2 days ago [ - ]

[dead]

mickey475778 2 days ago [ - ]

[dead]

pjmlp 2 days ago [ - ]

Or any other strongly typed natively compiled language for that matter.

hy4000days 2 days ago [ - ]

[dead]

randomtoast 2 days ago [ - ]

Given the complexity of modern compiler optimizations, integrating a small neural network into a C compiler like GCC might help generate faster executable code by guiding optimization decisions.

genewitch 2 days ago [ - ]

My startup is training an AI from scratch to emit x86_64 assembly from plain English prompts - we're skipping the middleman, the long in the tooth, incumbent fatcats.

V2 will be wasm and then everyone will be out of a job. You're welcome.

simne a day ago [ - ]

Yes and no.

Main problem of C optimization, C is not expressive language, and using it tends to quantize problem to tiny chunks, so lost overall picture.

So yes, probably LLVM approach with optimizing already coded something similar to bytecode, could got some speedup, but with using higher level languages, speedup could be magnitudes better.

I think, future belongs to some hybrid approaches, good known domains create, or better word - handcraft with C or even on Assembly, but for less known use something like Prolog solver.

nurettin 2 days ago [ - ]

-O3 -march=native is pretty much all you need and the rest is marginal or circumstantial.

randomtoast 2 days ago [ - ]

What makes you so confident that an AI-assisted compiler couldn't significantly enhance optimizations? A relevant example of a complex problem where neural networks have improved performance is found in chess engines. Today, top-level engines like Stockfish have integrated NNUE ("Efficiently Updatable Neural Network") which has significantly boosted their performance.

queuebert a day ago [ - ]

That doesn't fix suboptimal algorithm choices, but neither would a small NN in the compiler. A big NN could rewrite large sections of code without changing the logic, but why do that during translation instead of rewriting the source?