I sent the following email to the Flink dev mailing list.

TL;DR

  • We have over 1.2k open PRs, this is an issue as it makes new contributors think twice about committing and looks like a problem that is too-big-to-solve for committers.
  • There have been various attempts, over the last 6 years, to enable the Stale PR bot/action to prompt authors to refresh old PRs and auto-close them if no action is taken.
  • These were rejected as some committers felt this was punishing contributors for the committers not reviewing/closing PRs fast enough.
  • Others felt that, rather than “sweeping the problem under the rug”, using the Stale PR functionality would actually reveal the true scale of the issue. Allowing committers to see what were truly active PRs.
  • Other Apache projects such as Kafka, Beam, Spark, Airflow and many others have enabled the stale PR GitHub action.
  • Despite this, Kafka still has 1k open PRs. However, these PRs have all been updated/commented on in the last 3-4 months, so can be considered active.
  • For Flink, only 12% of the open PRs have been updated in the last 3 months and only 41% in the last year.
  • I propose we enable the Stale PR Github action to clear the backlog and reduce the PRs down to those that are active and relevant.
  • We can start with PRs that haven’t been active in the last year and give authors 3 months to refresh them. These thresholds could then be reduced over time, towards the norm for other Apache projects, of 3 months inactivity and 1 month to refresh.

The Problem

Currently, we have 1245 open PRs in the main upstream Flink GitHub repository. The oldest of which was created over seven and half years ago. Many of these PRs haven’t been commented on or interacted with in years.

I am definitely not here to cast blame. Flink is a huge project, the committers are volunteers and only have so much time. Also, Flink is certainly not the only open source project to face this issue. However, the large number of open PRs is a drag on the community, it makes new contributors think twice about opening PRs and I am sure it is demoralising for committers to see the mountain keep growing.

Dealing with this was a big part of why the Community Health Initiative (CHI) working group was set up. We are making progress on reviewing and triaging the top of the PR stack. However, the bottom of the stack is also an issue.

Background

It seems reasonable that a PR that is the better part of a decade old and hasn’t been commented on in years, is probably not relevant and could be closed. Indeed, this very point has been brought up before, first in 2018 where it was commented that:

The current situation with 350 open PRs may send a signal to contributors that it may actually be too much hassle to get a change committed in Flink.

At that time, there was some push-back to the proposal of using the stale PR bot. Mostly around auto-closing the PRs being perceived as harsh, given that the issue was mostly due to lack of committer review. The Beam community went ahead and enabled it, but the discussion on the Flink side seems to have then died out.

The stale PR bot was raised again in 2019 and had a lot of support, including several examples of other Apache projects using it to good effect. However, this was again pushed back against as hiding the symptoms of the underlying problem, namely committers not engaging actively enough to close PRs that were no longer relevant or had no hope of being merged. The counter argument to this was that the PR closing bot was only one part of a solution, not the whole solution and that far from hiding the problem, the stale labelling would highlight the scale of the issue.

The Stale PR closing issue was raised further in 2022 and 2023 with similar arguments. Including from CHI’s own David Radley:

We have over 1000 open prs. This is a lot of technical debt. I came across a 6 month old pr recently that had not been merged. A second Jira issue was raised for the same problem and a second pr fixed the issue (identically). The first pr was still on the backlog until we noticed it.

What other Apache projects are doing

The Stale PR/Issue GitHub action is used by many Apache projects including Beam, Kafka, Spark and Airflow to name a few.

Apache Kafka uses a 90 day (3 months) limit to define a stale PR and then allows a further 30 days (1 month) for the author to refresh the PR before it is auto-closed. Even with the Stale PR action enabled, Kafka still has over 1000 open PRs. However, all of these PRs were updated, commented on or otherwise interacted with in the last 3-4 months. This gives a much better overview of the scale of the open PR base.

For comparison, I did some basic analysis of Flink’s open PRs. 55% were updated in the last 2 years, 41% in the last year, 12% in the last 3 months and only 8% in the last month. It is reasonable to conclude that over half these PRs are probably not relevant anymore or need significant updates to be compatible.

What should we do?

My personal take on this is that, while I agree that the issue is mostly one of committer capacity to review these PRs, the upstream PR count is currently too high. It is discouraging engagement. But, it is also not fair to blame committers for not wanting to spend time on PRs that are years out of date and clearly not relevant anymore.

So I think we should declare PR bankruptcy and attempt to clear away the bulk of the old PRs. I don’t use the word “bankruptcy” flippantly or to provoke, just to acknowledge that the scale of the issue has gotten too large to be dealt with through the hard work of committers alone. Once we get the PR backlog to a manageable size, we can then focus on using initiatives like CHI and other other workflow improvements to keep the PR count low.

Proposal

Enable the stale PR GitHub action. This action would:

  • Identify any PR that has not been interacted with in the last X months as Stale:
    • Apply a Stale label to the PR
    • Comment on the PR that it is considered Stale and what to do to refresh it and how to engage further with the community. This will also allow committers to easily get a list of stale PRs to review and refresh/close.
  • Identify any Stale PR that hasn’t been refreshed (commented on or otherwise updated) after a further Y months as closeable.
    • Close the PR.
    • Leave a closing comment highlighting that it can be reopened at any point with pointers to how to engage the community.

The values of the stale (X) and close (Y) thresholds is up for discussion. At least initially, given the shear number of old PRs, we may want to be more lenient. For example X = 1 year, Y = 3 months, would limit the initial number of stale PRs and allow committers more time to review the stale PR list. Once the PR list has been reduced sufficiently we may want to reduce these values in increments until, for example, X = 3 months and Y = 1 month which seems to be the values other Apache projects have settled on.

Obviously, I am a relative newcomer to the community. I would really like to hear what others, especially committers, think of the above proposal and hear any other ideas people have for taming the PR count.

Alternatives

Looking through the history of discussion on the subject, on several occasions people have suggested doing more fine grained checks before closing PRs, such as:

closing up PRs after X days which: a) Don’t have a CI that has passed b) Don’t follow the code contribution guide (like commit naming conventions) c) Have changes requested but aren’t being followed-up by the contributor

This is of course an option, but would probably require updating FlinkBot. There is no reason we couldn’t enable both the Stale PR GitHub Action and update Flinkbot to enforce rules like those