CRAN’s approach here sounds like it has all the disadvantages of a monorepo without any of the advantages.

In a true monorepo — the one for the FreeBSD base system, say — if you make a PR that updates some low-level code, then the expectation is that you 1. compile the tree and run all the tests (so far so good), 2. update the high-level code so the tests pass (hmm), and 3. include those updates in your PR. In a true centralized monorepo, a single atomic commit can affect vertical-slice change through a dependency and all of its transitive dependents.

I don’t know what the equivalent would be in distributed “meta-monorepo” development ala CRAN, but it’s not what they’re currently doing.

(One hypothetical approach I could imagine, is that a dependency major-version release of a package can ship with AST-rewriting-algorithm code migrations, which automatically push both “dependency-computed” PRs to the dependents’ repos, while also pushing those same patches as temporary forced overlays onto releases of dependent packages until such time as the related PRs get merged. So your dependents’ tests still have to pass before you can release your package — but you can iteratively update things on your end until those tests do pass, and then trigger a simultaneous release of your package and your dependent packages. It’s then in your dependents’ court to modify + merge your PR to undo the forced overlay, asynchronously, as they wish.)

There is a parallel with database transactions: it's great if you can do everything in a single database/transaction (atomic monorepo commit). But that only scales so far (on both dimensions: single database and single transaction). You can try distributed transactions (multiple coordinated commits) but that also has limits. The next step is eventual consistency, which would be equivalent to releasing a new version of the component while preserving the old one and with dependents eventually migrating to it at their own pace.

Doesn't that rely on the code being able to work in both states?

I mean, to use a different metaphor, an incremental rollout is all fine and dandy until the old code discovers that it cannot work with the state generated by the new code.

Yes, but depending on the code you’re working on that may be the case anyway even with a monorepo.

For example a web api that talks to a database but is deployed with more than one instance that will get rolling updates to the new version to avoid any downtime. There will be overlapping requests to both old and new code at the same time.

Or if you want to do a trial deployment of the new version to 10% of traffic for some period of time.

Or if it’s a mobile or desktop installed app that talks to a server where you have to handle people using the previous version well after you’ve rolled out an update.

Yes, it does.

> One hypothetical approach I could imagine, is that a dependency major-version release of a package can ship with AST-rewriting-algorithm code migrations

Jane Street has something similar called a "tree smash" [1]. When someone makes a breaking change to their internal dialect of OCaml, they also push a commit updating the entire company monorepo.

It's not explicitly stated whether such migrations happen via AST rewrites, but one can imagine leveraging the existing compiler infrastructure to do that.

[1]: https://signalsandthreads.com/future-of-programming/#3535

This is more of less how Facebook developed PHP -> Hack on the fly. Each new language feature would be patched in, and at the same time, a whole-monorepo transform would be run to adopt the feature. Pretty neat, if a logistical nightmare

> In a true monorepo ...

ideally yes. However, such a monorepo can become increasingly complex as the software being maintained becomes larger and larger (and/or more and more people work on it).

You end up with massive changes - which might eventually become something that a single person cannot realistically contain within their brain. Not to mention clashes - you will have people making contradictory/conflicting changes, and there will have to be some sort of resolution mechanism outside (or the "default" one, which is first come first served).

Of course, you could "manage" this complexity by attributing api boundary/layers, and these api changes are deemed to be important to not change too often. But that simply means you're a monorepo only in name - not too different from having different repos with versioned artefacts with a defined api boundary.

>Of course, you could "manage" this complexity by attributing api boundary/layers, and these api changes are deemed to be important to not change too often. But that simply means you're a monorepo only in name - not too different from having different repos with versioned artefacts with a defined api boundary.

You have visibility into who is using what and you still get to do an atomic update commit even if a commit will touch multiple boundaries - I would say that's a big difference. I hated working with shared repos in big companies.

They don’t have to be massive changes. You can release the feature with with backwards compatibility and then gradually update dependencies and remove the old interface.

I think the way to go is that you do such big backwards incompatible refactors gradually. Eg you want to make all the callers specify some additional parameter. So first you create a version of your API which populates this parameter with some reasonable default. Then old API is marked deprecated and is just calling new API with that default value, and then you inline everywhere the old api. After a while it’s possible to remove the old API.

That said you need of course some tooling to somehow discover all the callers reliably and do those migrations on a large scale.

Easier to do if all the code is owned by one org but harder if you can’t reliably tell who’s using your APIs.

However having centralized migrations is really saving a lot of work for the org.

Yes, it's nice when you can update arbitrarily distant files in a single commit. But when an API is popular enough to be used by dozens of independent projects, this is no longer practical. Even in a monorepo, you'll still need to break it up, adding the new API, gradually migrating the usages, and then deleting the old API.

Yes,

Also the other problem of a big monorepo is that nothing ever dies. Let's say you have a library and there are 1000 client programs or other libraries of your API. Some of them are pretty popular and some of them are fringe.

However when you are changing the API they all have the same weight. You have to fix them all. In the non monorepo case the fringe clients will eventually die or their maintainer will invest on them and update them. It's like capitalism vs communism with central planning and all.

If the monorepo is build and tested by single build system (bazel, buck, etc.), then it can graph leaf dependencies with no users. For example library + tests, but no one using it (granted it might be something new popping out, still in early development).

Bazel has the concept of visibility where while you are developing something in the tree, you may explicitly say who can use it (like trial version).

But the point is, if something is build, it must be tested, and coverage should catch what is build, but not tested, but also should catch what is build and tested but not really used a lot.

But why remove it, if it takes no time to build & test (?), and if it takes more time to test, it's usually on your team to start your own testing env, and not rely on the general presubmit/preflight one, and because since the last capacity planning you have only that amount of budget, you'll soon realize - do we really need this piece of code & the tests?

I mean it's not perfect, there would be always something churning using time & money, but until it's pretty big problem it won't go away automatically (yet)

Dead cose in a huge monorwpo is more costly than just build and test time. It's also noise when searching through code. One thing to realize is that deleting dead code from the tree doesn't destroy anything because it's still in the repo history and can be restored from there.

Hence why Google has Sensenmann to reap dead code: https://testing.googleblog.com/2023/04/sensenmann-code-delet...

It's common to think monorepos are a way of shipping code. IMO, they're actually a mental model or model of your code and its dependencies. It's not just tooling for tooling sake, it gives you a new view into your code and its dependencies.

Holy fuck, this is great! Thank you!

Libraries with no usages can easily be deleted. For entire apps, it’s harder to tell whether they’re still used, because that information isn’t in the repo.

I agree, more automated tools for API migration would be a good next step, but I think that's missing the point a bit.

Read the actionable part of the "dependency error" mail again:

> Please reply-all and explain: Is this expected or do you need to fix anything in your package? If expected, have all maintainers of affected packages been informed well in advance? Are there false positives in our results?

This is not a hard fail and demand that you go back and rewrite your package. It's also not a demand for you to go out on your own and write pull requests for all the dependent packages.

The only strict requirement is to notify the dependents and explain the reason of that change. Depending on the nature of the change, it's then something the dependents can easily fix themselves - or, if they can't, you will likely get feedback what you'd have to change in your package to make the migration feasible.

In the end, it's a request for developers to get up and talk to their users and figure out a solution together, instead of just relying on automation and deciding everything unilaterally. It's sad that this is indeed a novel concept.

(And hey, as a side effect: If breaking changes suddenly have a cost for the author, this might give momentum to actually develop those automated migration systems. In a traditional package repository, no one might even have seen the need for them in the first place)

The author is a little confused. A system that blocks releases on defects and doesn't pin versions is continuous integration, not a monorepo. The two are not synonymous. Monorepos often use continuous integration to ensure their integrity, but you can use continuous integration without a monorepo, and monorepos can be used without continuous integration.

> But the migration had a steep cost: over 6 years later, there are thousands of projects still stuck on an older version.

This is a feature, not a bug. The pinning of versions allows systems to independently maintain their own dependency trees. This is how your Linux distribution actually remains stable (or used to, before the onslaught of "rolling release" distributions, and the infection of the "automatically updating application" into product development culture, which constantly leaves me with non-functional Mobile applications whereupon I am forced to update them once a week). You set the versions, and nothing changes, so you can keep using the same software, and it doesn't break. Until you choose to upgrade it and deal with all the breaking shit.

Every decision in life is a tradeoff. Do you go with no version numbers at all, always updating, always fixing things? Or do you always require version numbers, keeping things stable, but having difficulty updating because of a lack of compatible versions? Or do you find some middle ground? There are pros and cons to all these decisions. There is no one best way, only different ways.

For me the comparison to monorepo made a lot sense. One of the main features of monorepo is maintaining a DAG of dependencies and use that to decide which tests to run given a code change. CRAN package publishing seems to follow same idea.

> One of the main features of monorepo is maintaining a DAG of dependencies

No, that's the opposite of a monorepo (w/continuous integration). A monorepo w/continuous integration does not maintain any list of dependencies or relationships, by design. Every single commit is one global "version" which represents everything inside the repo. Everything in the repo at that commit, is only guaranteed to work with everything else in the repo in that commit. You use continuous integration (w/quality gates) to ensure this, by not allowing merges which could possibly break anything if merged.

Maintaining a DAG of dependencies is a version pinning strategy, the opposite of the continuous integration version-less method. It is intended for external dependencies that do not exist in the current repository - which is why it's used for multi-repos, not monorepos.

But as I originally pointed out, you can have a monorepo where everything is version-pinned (not using continuous integration). It's just not the usual example.

A lot of monorepo strategies that I've seen involve maintaining a DAG of dependencies so that you don't need to run CI over the entire system (which is wasteful if most of the code hasn't changed), but only a specific subset.

Each component within the monorepo will declare which other components it depends on. When a change occurs, the CI system figures out which components have changed, and then runs tests/build/etc for those components and all their dependencies. That way, you don't need to build the world every time, you just rebuild the specific parts that might have changed.

I think that specific concept (maintaining a single "world" repository but only rebuilding the parts that have changed in each iteration) is what the author is talking about here. It doesn't have to be done via a monorepo, but it's a very common feature in larger monorepos and I found the analogy helpful here.

That's a cool thing to have, and I'm glad you found the analogy helpful, but I hope you understand the CI DAG you're talking about is not making anything more stable. It is just to cache build jobs. To make things more stable (what the post is referring to) you need a separate mechanism; in a monorepo w/CI, that's gating the merge on test results (which doesn't require a DAG). (And actually, if you skip tests in a monorepo, you are likely to eventually miss systemic bugs)

That's something you can do just as well with multiple repos though

What a monorepo gives you on top of that is that you can change the dependents in the same PR

for me too - in a way a "virtual" monorepo - as if all these packages belong in some ideal monorepo, even though they don't.

The problem with pinning dependencies is clashing transitive dependencies over a bunch of dependencies. For me this happens in python every third time I try to run sth new even though version numbers are pinned (things can still fail in your system, or you may want to include dependencies with incompatible transitive dependencies). I have never happened with R, and now I know why.

The actual trade off is end user experience and ease, vs package developer experience and ease. It is not about updating R or a package; it is when somebody tries to create or run a project not getting into a clash of dependencies for reasons that can hardly be controlled by either them or the package developer.

Stability vs security, that is what pinning gives you and is why rolling releases are more popular these days. No?

Rolling releases are popular because people got sick of waiting two years to upgrade their distro and get the new version of some Linux app, because one version of a distro keeps the same old version of the Linux apps forever (in the stable tree). unstable and testing branches have newer releases, but as the name implies, it breaks quite a bit.

So rolling releases are like an unstable/testing branch, with more effort put into keeping it from breaking. So you get new software all the time. The downside is, you also don't get to opt-out of an upgrade, which can be pretty painful when the upgrade breaks something you're used to.

> There is no one best way

I think that the laws of physics dictate that there is. If your developers are spawning the galaxy, the speed of development is slower with continuous development than with pinning deps.

We don't know the laws of physics though. We just have models which can both fit into human brain to a point that they are surprisingly good for what humans have an opportunity to experiment agaisnt. That is really awesome, but it doesn't mean we know.

It’s partial knowledge.

Saying “We don’t know.” feels more wrong to me than “We know.” (emphasis on the periods).

They’re not confused. It’s an analogy.

I don't see that. What I do see is an interesting thought experiment in the title and then zero delivery in the body text.

The term clickbait comes to mind.

To be honest, I don't know what is worse. Installing a R library that require re-installing a bunch of updates, and being stuck in R installation hell or exerpiencing conda install that is stuck in "Resolving Dependencies" hell. The only thing I've learned to mitigate both is just containerize everything.

I genuinely enjoy R. I use it for calculations daily. In comparison using Python feels tedious and clunky even though I know it better.

> CRAN had also rerun the tests for all packages that depend on mine, even if they don’t belong to me!

Another way to frame this is these are the customers of your package's API. If you broke them you are required to ship a fix.

I see why this isn't the default (e.g. on GitHub you have no idea how many people depend on you). But the developer experience is much nicer like this. Google, for example, makes this promise with some of their public tools.

Outside the word of professional software developers, R is used by many academics in statistics, economics, social sciences etc. This rule makes it less likely that their research breaks because of some obscure dependency they don't understand.

I've never written a line of R but it seems slightly underrated from what I've seen.

Maybe there are some massive footguns I'm not aware of but python is mostly oriented around variables rather than pipelines so it never seems to flow as well as R

What people have to understand is that this is not a repository for software developers. It is a repository for people needing tools for statistical analysis and scientific computing.

The way a software developer thinks about a package is totally different to the way someone trying to perform statistical analysis thinks about packages.

This is the same for CTAN, the name is no coincidence. The packages are for users and not developers.

As an admitted R hater, this is exceptionally good framing. Reproducible science has only ever had limited buy-in, but R was extra late to the party with packrat and renv. Enforcing code consistency at this level has probably done wonders to reducing some amount of churn.

This was an interesting article, but it made me even more interested in the author's larger take on R as a language:

> In the years since, my discomfort has given away to fascination. I’ve come to respect R’s bold choices, its clarity of focus, and the R community’s continued confidence to ‘do their own thing’.

I would love to see a follow-up article about the key insights that the author took away from diving more deeply into R.

> When declaring dependencies, most packages don’t specify any version requirements, and if they do, it’s usually just a lower bound like ‘grf >= 1.0’.

I like the perspective presented in this article, I think CRAN is taking an interesting approach. But this is nuts and bolts. Explicitly saying you're compatible with any future breaking changes!? You can't possibly know that!

I get that a lot of R programmers might be data scientists first and programmers second, so many of them probably don't know semver, but I feel like the language should guide them to a safe choice here. If CRAN is going to email you about reverse dependencies, maybe publishing a package with a crazy semver expression should also trigger an email.

> Explicitly saying you're compatible with any future breaking changes!? You can't possibly know that!

I kind of like it in a way. In a lot of eco systems it's easy for package publishers to be a bit lazy with compatibility which can push a huge amount of work on package consumers. R seems similar to go in this regard, where there is a big focus on not breaking compatibility which then means they are conservative about adding new stuff until they're happy to support it for a long time.

I guess it wouldn't bother me if it weren't a semver expression. As a semver expression it's ridiculous on it's face, a breaking release will break your code until proven otherwise. "foo >= 2024R1", well, I'm not entirely comfortable with it but if you've got a comprehensive plan to address the potential dangers (as CRAN appears to), godspeed.

> In what other ecosystem would a top package introduce itself using an eight-variable equation?

That's the objective function of Hastie et al's GLM. I had a good chuckle when I realized the author's last name is Tibshirani. If you know you know.

And if I don't know, can I know?

Hastie and Tibshirani wrote a famous book on ML (https://hastie.su.domains/ElemStatLearn/), and extended GLMs into GAMs: https://en.wikipedia.org/wiki/Generalized_additive_model

Peeved that it isn't an equation. There's no equals signs!

I'm the author. Thanks for the copy-edit, I should have indeed said "expression".

Objective function or expression might have been more precise, indeed.

Robert Tibshirani has a daughter named Julie.

This is awesome. I run a team that uses software I produce and i have a rule that i can’t deliver breaking changes, and i cant force migrations. I can do the migration myself, or i have to emulate the old behavior next to the new. It makes you think really hard about releasing new APIs. I wish this was standard practice.

Sounds like an easy workaround would be versioned APIs then. Missing that, it sounds like the API will forever be stuck, or add-only, creating a mess. That is, if it is not already very stable.

This (with some tweaks) is what I envision the future of NPM, Cargo, and NuGet should look like.

Automated tests, compilation by the package publisher, and enforcement of portability flags and SemVer semantics.

One workaround that isn't mentioned is that one could just release a new package entirely for each blocked release. grf1, grf2, grf3...

The downside is that dependees have to manually change their dependency and you get proliferation of packages with informal relationships.

I think there's tooling for doing something like this reverse dependency trick in nixpkgs. I made a change to pre-commit and somebody more in-the-know than I stopped by the PR and pointed out the two python packages that my changes broke.

Zero wouldn't have been surprising to me, nor would several hundred, but two... what a conveniently actionable number.

It has me wanting to give names to some of my hacks and publish them as packages so that people are more more aware when their changes are breaking changes. On the other hand, if I do something weird, I don't necessity want to burden others with maintaining it. Tradeoffs...

Debian is kind of like that, except packages broken by upgrades are mostly just removed.

I've been using it for so many years, and now it makes complete sense now that you mentioned that! THanks!

Eventually, yes I guess. But long before that the breaker and breakee both are notified, and the breakage hopefully is fixed. As it should be.

I would hope the other aspirational software distribution systems (pip, npm, et al) ALSO do that, but according to this article, I guess they don't? Not shocked , to be honest

But how would that work?

Say I have software written that runs just fine, but has not been updated to the latest runtime of Python or Node (as per your example). Perhaps a dependency I use has a broken recent version, but the old version I use works fine. You remove the package, now it breaks my software. This would effectively make it so that all libraries / dependencies that are "abandoned" by the author or inactive, would be deleted, which then results in all the software that used them to also break.

Unless I misunderstood something?

> But… CRAN had also rerun the tests for all packages that depend on mine, even if they don’t belong to me!

When you propose a change to something that other things depend on, it makes sense to test those dependents for a regression; this is not earth shattering.

If you want to change something which breaks them, you have to then do it in a different way. First provide a new way of doing something. Then get all the dependencies that use the old way to migrate to the new way. Then when the dependents are no longer relying on the old way, you can push out a change which removes it.

>If you want to change something which breaks them, you have to then do it in a different way.

Almost every other package repo works differently and publishing packages which would break other packages is more than common, it is the standards way to publish updates.

>Then get all the dependencies that use the old way to migrate to the new way. Then when the dependents are no longer relying on the old way, you can push out a change which removes it.

This is actually how no software repo works as it is actually insane.

Nope!

- Breaking things obviously insane rather than not breaking things.

- Staged obsolescence is sane compared to breaking things, and compared to installations carrying multiple versions of the same package.

There being single version of every package in any given installation, and every package carefully managed for backwards compatibility, removing features only when everything has migrated off them, is utterly sane.

There may be exceptions. (The idea that there are never exceptions is insane). Suppose that it is discovered that there is no secure way of using some API, and there is a real threat: it has to be removed.

I think that a good way forward would be to identify the packages which use the API, and contact all the maintainers for an emergency conference.

The package system should also be capable of applying its own patches to packages; if something must break for backwards compatibility, the package system should provide patches to fix the broken packages until the upstreams develops their own fixes.

Have you ever worked with a package manager for a programming language?

In almost every case it is the sole duty of the people using the packages to ensure they adhere to whatever standards they desire. This is how packaging software works in basically every case.

>The package system should also be capable of applying its own patches to packages; if something must break for backwards compatibility, the package system should provide patches to fix the broken packages until the upstreams develops their own fixes.

This is just you not understanding software. This is obviously not possible, it also is not desirable.

> Taking advantage of the major version bump, we had snuck in a small API change. This change then caused a test failure in policytree

Wait a second. Another package failed your MAJOR version upgrade because you changed your API? Not following semver is crazy for any package manager to enforce.

I just put my own library dependencies into submodules. They act like local copies, so you can develop them while developing the main repo.

That essentially makes the high level project a monorepo while giving you the option to work on the submodule on its own.

I recently started using python packages for some statistical work.

I then discovered that there are often bugs with many of the python stats packages. Many python numerical packages also have the reputation of changing how things work "under the hood" from version to version. This means that you can't trust your output after a version change.

Given all of the above, I can see why "serious" data scientists stick with R and this article is just another reason why.

Might be useful to add "R" somewhere to the title to make it clearer what this article is about.

I feel like if more package repositories did this, you would end up just finding more and more workarounds and alternative distribution methods.

I mean, just look at how many projects use “curl and bash” as their distribution method even though the project repositories they could use instead don’t even require anything nearly as onerous as the reverse dependency checks described in this article. If the minimal requirements the current repos have are enough to push projects to alternate distribution, I can’t imagine what would happen if it was added.

Damn. Well, time to fork everything and keep internal patches internal.

This system is unworkable.

That’s the default in the monorepos I’ve worked on.

When a third party dep is broken or needs a workaround, just include a patch in the build (or fork). Then those patches can be upstreamed asynchronously without slowing down development.

My wife is having to learn R as part of her Masters - she's not got a technical background and my impression is it's throwing people into programming at the deep end, and likely after the course she will never actually use it.

Meanwhile I read the material and t absolutely feels like a cult - "R is fun" is like something they say to persuade themselves they are not in a cult

[dead]

[flagged]

Please don't do this here.

I know I know, I would never make it in a corporate environment I would get an appointment with HR the first day to explain my "funny" jokes XD