Blame SOURCES/0023-Base.reset-plug-temporary-leak-of-libsolv-s-page-fil.patch

bc72d2
From 88a6289a4f72b11253c01a5a5d834b74d5abb6c3 Mon Sep 17 00:00:00 2001
bc72d2
From: Laszlo Ersek <lersek@redhat.com>
bc72d2
Date: Sun, 24 Apr 2022 09:08:28 +0200
bc72d2
Subject: [PATCH] Base.reset: plug (temporary) leak of libsolv's page file
bc72d2
 descriptors
bc72d2
bc72d2
Consider the following call paths (mixed Python and C), extending from
bc72d2
livecd-creator down to libsolv:
bc72d2
bc72d2
  main                                                [livecd-tools/tools/livecd-creator]
bc72d2
    install()                                         [livecd-tools/imgcreate/creator.py]
bc72d2
      fill_sack()                                     [dnf/dnf/base.py]
bc72d2
        _add_repo_to_sack()                           [dnf/dnf/base.py]
bc72d2
          load_repo()                                 [libdnf/python/hawkey/sack-py.cpp]
bc72d2
            dnf_sack_load_repo()                      [libdnf/libdnf/dnf-sack.cpp]
bc72d2
              write_main()                            [libdnf/libdnf/dnf-sack.cpp]
bc72d2
                repo_add_solv()                       [libsolv/src/repo_solv.c]
bc72d2
                  repopagestore_read_or_setup_pages() [libsolv/src/repopage.c]
bc72d2
                    dup()
bc72d2
              write_ext()                             [libdnf/libdnf/dnf-sack.cpp]
bc72d2
                repo_add_solv()                       [libsolv/src/repo_solv.c]
bc72d2
                  repopagestore_read_or_setup_pages() [libsolv/src/repopage.c]
bc72d2
                    dup()
bc72d2
bc72d2
The dup() calls create the following file descriptors (output from
bc72d2
"lsof"):
bc72d2
bc72d2
> COMMAND  PID USER FD  TYPE DEVICE SIZE/OFF   NODE NAME
bc72d2
> python3 6500 root  7r  REG    8,1 25320727 395438 /var/tmp/imgcreate-mytcghah/install_root/var/cache/dnf/fedora.solv (deleted)
bc72d2
> python3 6500 root  8r  REG    8,1 52531426 395450 /var/tmp/imgcreate-mytcghah/install_root/var/cache/dnf/fedora-filenames.solvx
bc72d2
bc72d2
These file descriptors are *owned* by the DnfSack object (which is derived
bc72d2
from GObject), as follows:
bc72d2
bc72d2
  sack->priv->pool->repos[1]->repodata[1]->store.pagefd = 7
bc72d2
  sack->priv->pool->repos[1]->repodata[2]->store.pagefd = 8
bc72d2
  ^     ^     ^     ^         ^            ^      ^
bc72d2
  |     |     |     |         |            |      |
bc72d2
  |     |     |     |         |            |      int
bc72d2
  |     |     |     |         |            Repopagestore [libsolv/src/repopage.h]
bc72d2
  |     |     |     |         Repodata                   [libsolv/src/repodata.h]
bc72d2
  |     |     |     struct s_Repo                        [libsolv/src/repo.h]
bc72d2
  |     |     struct s_Pool (aka Pool)                   [libsolv/src/pool.h]
bc72d2
  |     DnfSackPrivate                                   [libdnf/libdnf/dnf-sack.cpp]
bc72d2
  DnfSack                                                [libdnf/libdnf/dnf-sack.h]
bc72d2
bc72d2
The file descriptors are *supposed* to be closed on the following call
bc72d2
path:
bc72d2
bc72d2
  main                                         [livecd-tools/tools/livecd-creator]
bc72d2
    install()                                  [livecd-tools/imgcreate/creator.py]
bc72d2
      close()                                  [livecd-tools/imgcreate/dnfinst.py]
bc72d2
        close()                                [dnf/dnf/base.py]
bc72d2
          reset()                              [dnf/dnf/base.py]
bc72d2
            _sack = None
bc72d2
            _goal = None
bc72d2
            _transaction = None
bc72d2
              ...
bc72d2
                dnf_sack_finalize()            [libdnf/libdnf/dnf-sack.cpp]
bc72d2
                  pool_free()                  [libsolv/src/pool.c]
bc72d2
                    pool_freeallrepos()        [libsolv/src/pool.c]
bc72d2
                      repo_freedata()          [libsolv/src/repo.c]
bc72d2
                        repodata_freedata()    [libsolv/src/repodata.c]
bc72d2
                          repopagestore_free() [libsolv/src/repopage.c]
bc72d2
                            close()
bc72d2
bc72d2
Namely, when dnf.Base.reset() [dnf/dnf/base.py] is called with (sack=True,
bc72d2
goal=True), the reference counts of the objects pointed to by the "_sack",
bc72d2
"_goal" and "_transaction" fields are supposed to reach zero, and then, as
bc72d2
part of the DnfSack object's finalization, the libsolv file descriptors
bc72d2
are supposed to be closed.
bc72d2
bc72d2
Now, while this *may* happen immediately in dnf.Base.reset(), it may as
bc72d2
well not. The reason is that there is a multitude of *circular references*
bc72d2
between DnfSack and the packages that it contains. When dnf.Base.reset()
bc72d2
is entered, we have the following picture:
bc72d2
bc72d2
     _sack                   _goal
bc72d2
        |                      |
bc72d2
        v                      v
bc72d2
   +----------------+      +-------------+
bc72d2
   | DnfSack object | <--- | Goal object |
bc72d2
   +----------------+      +-------------+
bc72d2
     |^    |^    |^
bc72d2
     ||    ||    ||
bc72d2
     ||    ||    ||
bc72d2
  +--||----||----||---+
bc72d2
  |  v|    v|    v|   | <-- _transaction
bc72d2
  | Pkg1  Pkg2  PkgN  |
bc72d2
  |                   |
bc72d2
  | Transaction oject |
bc72d2
  +-------------------+
bc72d2
bc72d2
That is, the reference count of the DnfSack object is (1 + 1 + N), where N
bc72d2
is the number of packages in the transaction. Details:
bc72d2
bc72d2
(a) The first reference comes from the "_sack" field, established like
bc72d2
    this:
bc72d2
bc72d2
      main                       [livecd-tools/tools/livecd-creator]
bc72d2
        install()                [livecd-tools/imgcreate/creator.py]
bc72d2
          fill_sack()            [dnf/dnf/base.py]
bc72d2
            _build_sack()        [dnf/dnf/sack.py]
bc72d2
              Sack()
bc72d2
                sack_init()      [libdnf/python/hawkey/sack-py.cpp]
bc72d2
                  dnf_sack_new() [libdnf/libdnf/dnf-sack.cpp]
bc72d2
bc72d2
(b) The second reference on the DnfSack object comes from "_goal":
bc72d2
bc72d2
      main                        [livecd-tools/tools/livecd-creator]
bc72d2
        install()                 [livecd-tools/imgcreate/creator.py]
bc72d2
          fill_sack()             [dnf/dnf/base.py]
bc72d2
             _goal = Goal(_sack)
bc72d2
               goal_init()        [libdnf/python/hawkey/goal-py.cpp]
bc72d2
                 Py_INCREF(_sack)
bc72d2
bc72d2
(c) Then there is one reference to "_sack" *per package* in the
bc72d2
    transaction:
bc72d2
bc72d2
      main                                  [livecd-tools/tools/livecd-creator]
bc72d2
        install()                           [livecd-tools/imgcreate/creator.py]
bc72d2
          runInstall()                      [livecd-tools/imgcreate/dnfinst.py]
bc72d2
            resolve()                       [dnf/dnf/base.py]
bc72d2
              _goal2transaction()           [dnf/dnf/base.py]
bc72d2
                list_installs()             [libdnf/python/hawkey/goal-py.cpp]
bc72d2
                  list_generic()            [libdnf/python/hawkey/goal-py.cpp]
bc72d2
                    packagelist_to_pylist() [libdnf/python/hawkey/iutil-py.cpp]
bc72d2
                      new_package()         [libdnf/python/hawkey/sack-py.cpp]
bc72d2
                        Py_BuildValue()
bc72d2
                ts.add_install()
bc72d2
bc72d2
    list_installs() creates a list of packages that need to be installed
bc72d2
    by DNF. Inside the loop in packagelist_to_pylist(), which constructs
bc72d2
    the elements of that list, Py_BuildValue() is called with the "O"
bc72d2
    format specifier, and that increases the reference count on "_sack".
bc72d2
bc72d2
    Subsequently, in the _goal2transaction() method, we iterate over the
bc72d2
    package list created by list_installs(), and add each package to the
bc72d2
    transaction (ts.add_install()). After _goal2transaction() returns,
bc72d2
    this transaction is assigned to "self._transaction" in resolve(). This
bc72d2
    is where the last N (back-)references on the DnfSack object come from.
bc72d2
bc72d2
(d) Now, to quote the defintion of the DnfSack object
bc72d2
    ("libdnf/docs/hawkey/tutorial-py.rst"):
bc72d2
bc72d2
> *Sack* is an abstraction for a collection of packages.
bc72d2
bc72d2
    That's why the DnfSack object references all the Pkg1 through PkgN
bc72d2
    packages.
bc72d2
bc72d2
So, when the dnf.Base.reset() method completes, the picture changes like
bc72d2
this:
bc72d2
bc72d2
     _sack                     _goal
bc72d2
        |                        |
bc72d2
   -- [CUT] --              -- [CUT] --
bc72d2
        |                        |
bc72d2
        v                |       v
bc72d2
   +----------------+   [C]  +-------------+
bc72d2
   | DnfSack object | <-[U]- | Goal object |
bc72d2
   +----------------+   [T]  +-------------+
bc72d2
     |^    |^    |^      |
bc72d2
     ||    ||    ||
bc72d2
     ||    ||    ||         |
bc72d2
  +--||----||----||---+    [C]
bc72d2
  |  v|    v|    v|   | <--[U]-- _transaction
bc72d2
  | Pkg1  Pkg2  PkgN  |    [T]
bc72d2
  |                   |     |
bc72d2
  | Transaction oject |
bc72d2
  +-------------------+
bc72d2
bc72d2
and we are left with N reference cycles (one between each package and the
bc72d2
same DnfSack object).
bc72d2
bc72d2
This set of cycles can only be cleaned up by Python's generational garbage
bc72d2
collector <https://stackify.com/python-garbage-collection/>. The GC will
bc72d2
collect the DnfSack object, and consequently close the libsolv page file
bc72d2
descriptors via dnf_sack_finalize() -- but garbage collection will happen
bc72d2
*only eventually*, unpredictably.
bc72d2
bc72d2
This means that the dnf.Base.reset() method breaks its interface contract:
bc72d2
bc72d2
> Make the Base object forget about various things.
bc72d2
bc72d2
because the libsolv file descriptors can (and frequently do, in practice)
bc72d2
survive dnf.Base.reset().
bc72d2
bc72d2
In general, as long as the garbage collector only tracks process-private
bc72d2
memory blocks, there's nothing wrong; however, file descriptors are
bc72d2
visible to the kernel. When dnf.Base.reset() *temporarily* leaks file
bc72d2
descriptors as explained above, then immediately subsequent operations
bc72d2
that depend on those file descriptors having been closed, can fail.
bc72d2
bc72d2
An example is livecd-creator's unmounting of:
bc72d2
bc72d2
  /var/tmp/imgcreate-mytcghah/install_root/var/cache/dnf
bc72d2
bc72d2
which the kernel refuses, due to libsolv's still open file descriptors
bc72d2
pointing into that filesystem:
bc72d2
bc72d2
> umount: /var/tmp/imgcreate-mytcghah/install_root/var/cache/dnf: target
bc72d2
> is busy.
bc72d2
> Unable to unmount /var/tmp/imgcreate-mytcghah/install_root/var/cache/dnf
bc72d2
> normally, using lazy unmount
bc72d2
bc72d2
(Unfortunately, the whole lazy umount idea is misguided in livecd-tools;
bc72d2
it's a misfeature that should be removed, as it permits the corruption of
bc72d2
the loop-backed filesystem. Now that the real bug is being fixed in DNF,
bc72d2
lazy umount is not needed as a (broken) workaround in livecd-tools. But
bc72d2
that's a separate patch for livecd-tools:
bc72d2
<https://github.com/livecd-tools/livecd-tools/pull/227>.)
bc72d2
bc72d2
Plug the fd leak by forcing a garbage collection in dnf.Base.reset()
bc72d2
whenever we cut the "_sack", "_goal" and "_transaction" links -- that is,
bc72d2
when the "sack" and "goal" parameters are True.
bc72d2
bc72d2
Note that precisely due to the unpredictable behavior of the garbage
bc72d2
collector, reproducing the bug may prove elusive. In order to reproduce it
bc72d2
deterministically, through usage with livecd-creator, disabling automatic
bc72d2
garbage collection with the following patch (for livecd-tools) is
bc72d2
sufficient:
bc72d2
bc72d2
> diff --git a/tools/livecd-creator b/tools/livecd-creator
bc72d2
> index 291de10cbbf9..8d2c740c238b 100755
bc72d2
> --- a/tools/livecd-creator
bc72d2
> +++ b/tools/livecd-creator
bc72d2
> @@ -31,6 +31,8 @@ from dnf.exceptions import Error as DnfBaseError
bc72d2
>  import imgcreate
bc72d2
>  from imgcreate.errors import KickstartError
bc72d2
>
bc72d2
> +import gc
bc72d2
> +
bc72d2
>  class Usage(Exception):
bc72d2
>      def __init__(self, msg = None, no_error = False):
bc72d2
>          Exception.__init__(self, msg, no_error)
bc72d2
> @@ -261,5 +263,6 @@ def do_nss_libs_hack():
bc72d2
>      return hack
bc72d2
>
bc72d2
>  if __name__ == "__main__":
bc72d2
> +    gc.disable()
bc72d2
>      hack = do_nss_libs_hack()
bc72d2
>      sys.exit(main())
bc72d2
bc72d2
Also note that you need to use livecd-tools at git commit 4afde9352e82 or
bc72d2
later, for this fix to make any difference: said commit fixes a different
bc72d2
(independent) bug in livecd-tools that produces identical symptoms, but
bc72d2
from a different origin. In other words, if you don't have commit
bc72d2
4afde9352e82 in your livecd-tools install, then said bug in livecd-tools
bc72d2
will mask this DNF fix.
bc72d2
bc72d2
Signed-off-by: Laszlo Ersek <lersek@redhat.com>
bc72d2
(cherry picked from commit 5ce5ed1ea08ad6e198c1c1642c4d9ea2db6eab86)
bc72d2
---
bc72d2
 dnf/base.py | 41 +++++++++++++++++++++++++++++++++++++++++
bc72d2
 1 file changed, 41 insertions(+)
bc72d2
bc72d2
diff --git a/dnf/base.py b/dnf/base.py
bc72d2
index babca31d..852fcdd8 100644
bc72d2
--- a/dnf/base.py
bc72d2
+++ b/dnf/base.py
bc72d2
@@ -72,6 +72,7 @@ import dnf.transaction
bc72d2
 import dnf.util
bc72d2
 import dnf.yum.rpmtrans
bc72d2
 import functools
bc72d2
+import gc
bc72d2
 import hawkey
bc72d2
 import itertools
bc72d2
 import logging
bc72d2
@@ -568,6 +569,46 @@ class Base(object):
bc72d2
             self._comps_trans = dnf.comps.TransactionBunch()
bc72d2
             self._transaction = None
bc72d2
         self._update_security_filters = []
bc72d2
+        if sack and goal:
bc72d2
+            # We've just done this, above:
bc72d2
+            #
bc72d2
+            #      _sack                     _goal
bc72d2
+            #         |                        |
bc72d2
+            #    -- [CUT] --              -- [CUT] --
bc72d2
+            #         |                        |
bc72d2
+            #         v                |       v
bc72d2
+            #    +----------------+   [C]  +-------------+
bc72d2
+            #    | DnfSack object | <-[U]- | Goal object |
bc72d2
+            #    +----------------+   [T]  +-------------+
bc72d2
+            #      |^    |^    |^      |
bc72d2
+            #      ||    ||    ||
bc72d2
+            #      ||    ||    ||         |
bc72d2
+            #   +--||----||----||---+    [C]
bc72d2
+            #   |  v|    v|    v|   | <--[U]-- _transaction
bc72d2
+            #   | Pkg1  Pkg2  PkgN  |    [T]
bc72d2
+            #   |                   |     |
bc72d2
+            #   | Transaction oject |
bc72d2
+            #   +-------------------+
bc72d2
+            #
bc72d2
+            # At this point, the DnfSack object would be released only
bc72d2
+            # eventually, by Python's generational garbage collector, due to the
bc72d2
+            # cyclic references DnfSack<->Pkg1 ... DnfSack<->PkgN.
bc72d2
+            #
bc72d2
+            # The delayed release is a problem: the DnfSack object may
bc72d2
+            # (indirectly) own "page file" file descriptors in libsolv, via
bc72d2
+            # libdnf. For example,
bc72d2
+            #
bc72d2
+            #   sack->priv->pool->repos[1]->repodata[1]->store.pagefd = 7
bc72d2
+            #   sack->priv->pool->repos[1]->repodata[2]->store.pagefd = 8
bc72d2
+            #
bc72d2
+            # These file descriptors are closed when the DnfSack object is
bc72d2
+            # eventually released, that is, when dnf_sack_finalize() (in libdnf)
bc72d2
+            # calls pool_free() (in libsolv).
bc72d2
+            #
bc72d2
+            # We need that to happen right now, as callers may want to unmount
bc72d2
+            # the filesystems which those file descriptors refer to immediately
bc72d2
+            # after reset() returns. Therefore, force a garbage collection here.
bc72d2
+            gc.collect()
bc72d2
 
bc72d2
     def _closeRpmDB(self):
bc72d2
         """Closes down the instances of rpmdb that could be open."""
bc72d2
-- 
bc72d2
2.35.1
bc72d2