Why Are Packages Installed Rather Than Just Linked to a Specific Environment

Why are packages installed rather than just linked to a specific environment?

Conda already does this. However, because it leverages hardlinks, it is easy to overestimate the space really being used, especially if one only looks at the size of a single env at a time.

To illustrate the case, let's use du to inspect the real disk usage. First, if I count each environment directory individually, I get the uncorrected per env usage

$ for d in envs/*; do du -sh $d; done
2.4G envs/pymc36
1.7G envs/pymc3_27
1.4G envs/r-keras
1.7G envs/stan
1.2G envs/velocyto

which is what it might look like from a GUI.

Instead, if I let du count them together (i.e., correcting for the hardlinks), we get

$ du -sh envs/*
2.4G envs/pymc36
326M envs/pymc3_27
820M envs/r-keras
927M envs/stan
548M envs/velocyto

One can see that a significant amount of space is already being saved here.

Most of the hardlinks go back to the pkgs directory, so if we include that as well:

$ du -sh pkgs envs/*
8.2G pkgs
400M envs/pymc36
116M envs/pymc3_27
92M envs/r-keras
62M envs/stan
162M envs/velocyto

one can see that outside of the shared packages, the envs are fairly light. If you're concerned about the size of my pkgs, note that I have never run conda clean on this system, so my pkgs directory is full of tarballs and superseded packages, plus some infrastructure I keep in base (e.g., Jupyter, Git, etc).

How does conda know which packages are unused?

Conda counts hardlinks

Conda uses hardlinks to minimize physical disk usage. That is, a single physical copy of lib/libz.a may be referenced from the package cache (where it was first unpacked), and then in multiple environments.

Conda determines eligibility for removing a package from the package cache by counting the number of hardlinks for the files in each package. Hardlink counts are tracked by the filesystem, not by Conda. An outline of the relevant code is:

# keep a list of packages to remove
pkgs_to_remove = []

# look in all package caches (there can be multiple)
for pkg_cache in pkgs_dirs:

# check all packages
for pkg in pkg_cache:

# assume removable unless...
remove_pkg = True
for file in pkg:
# is there evidence that it is linked elsewhere?
if num_links(file) > 1:
# if so, don't remove, and move on
remove_pkg = False
break

# add it to list is removable
if remove_pkg:
pkgs_to_remove.append(pkg)

# output some info on `pkgs_to_remove`
# check if user wants to execute removal

That is, if any file in a package has more than one link, then Conda will conclude it is used in another environment, and move on to the next package.

Note that filesystems don't keep track of symbolic links (a.k.a., symlinks, softlinks), and Conda doesn't track them, hence, Conda warns about cleaning packages in combination with the allow_softlinks setting.

Conda: Choose where packages are downloaded for each environment

A fundamental concept of conda is that packages get downloaded and extracted to a shared cache, from which they are selectively linked into the different conda environments. You want to work against this concept, so whatever you do will be hacky and have repercussions.

You could install a separate Miniconda for each of your projects, and (try to) make sure that they don't know about eachother by removing all conda-related files and environment settings from your home directory, or even use a different HOME for each project. Before working with a project, you'd have to put the correct conda on the PATH.

Or you could install Miniconda on a dedicated drive apart from your home directory, and put the conda environments inside your home directory. That would prevent conda from hard-linking the files. It would still download the packages into the shared cache, but then copy only the relevant files into each of your projects. Of course, copying is slower than hard-linking.

Specifying the package directory per environment rather than per conda installation is not possible, as darthbith has already pointed out in a comment.

Why is a cloned conda environment taking so much space?

Actually, Conda already kinda does share the env spaces. However, because it leverages hardlinks, it is easy to overestimate the space really being used. (read more)

in any case, the answer to your question might lie in the difference between Anaconda & Miniconda.
Anaconda is about 2GB, while Miniconda is closer to 100MB.

Anaconda includes a long list of packages that get installed automatically into each environment that you create.

Miniconda creates barebone conda virtual environments (which don't contain many packages at all). Switching to Miniconda should substantially reduce the size/number of packages in your environments.

Conda also uses hardlinks for packages installed vs conda install. A good description of hardlinks can be found here. They basically link dependencies across multiple environments like you've described above. Packages installed via pip are not hardlinked, so they cannot take advantage of the space savings that conda packages offer.



Related Topics



Leave a reply



Submit