Cool project, I've been meaning to do this myself at work for a codebase, and it's nice to see that this exists now.

Does the project you simply compute embeddings for every function unit and cluster them, or do we also mean-pool significant dependencies of a function? In other words, given the function

    def a():
      b()
      c()
      d()
Do we also embed b, c, and d as well and combine them somehow in the embedding of a?

Based on your example there is only a single function a() which is embedded. The rest is just a code and dependencies are not resolved. Did you think about adding this feature in your tool?

It looks like it works only on function bodies[1]. I'm not sure I understand why you would want to look at invoked callables code, though. Calling the same set of helper functions is already flagged; repeated code in helpers is flagged as well when those helpers are analyzed. Do you have a specific example where you'd like a function flagged as a duplicate based on the code it calls out to?

[1] https://github.com/rafal-qa/slopo/blob/main/src/slopo/indexi...