Suppose you've written some Python code that you want to share. Other users will have to get your code and perform some setup operations, including making their Python environment aware of your package so they can
import it. Ideally you'd communicate information about any other modules that your module requires, so that users can make sure they have all of the requirements before they try to use your module. When you make improvements to your code, you'd like for your users to be able to get those changes as effortlessly as possible, preferably without having to go through installation steps again.
These code distribution challenges are difficult to manage manually, so developers have built systems designed to automate code distribution processes. These systems are called package managers. The main package managers for Python are Pip and Anaconda. Pip is a general Python installer, installing packages from the Python Package Index. Anaconda is more geared toward data science, and it installs packages from its own collection called Anaconda Repository.
We recommend installing Anaconda and using it to manage your packages. Anaconda has a few important advantages over Pip:
- Anaconda ensures that all requirements of all available packages are satisfied. Pip updates your environment when you install a package based on that package's requirements. Such an update might break previously installed packages, since they might depend on a different version of the same package.
- Anaconda provides built-in support for managing multiple virtual environments. If Package A and Package B have incompatible versions of Package C, you can set up one virtual environment with Package A and one version Package C, and a second virtual environment with Package B and another version of Package C.
- For packages that depend on compiled code, Anaconda directly installs
binaries. This means that these dependencies are built by the package maintainers sent to you ready to run. If the build process happens on your computer, there are more opportunities for things to go wrong in the installation process.
Some packages are available on PyPI but not Anaconda, and in these cases we recommend that you use
- Virtual environments are important for
- Because Conda is doing
computation to ensure all required dependencies are met, it often takes longer than Pip to install a new package.
- Separate virtual environments can be used to manage incompatible dependencies between two projects.
Your Python environment is the set of packages you have available to
import in a Python session. For example, a user's Python environment often includes all of the packages installed on the computer. A virtual environment emulates such an environment by exposing specific packages (and specific versions of those packages) to the Python interpreter. Virtual environments are useful because they allow the user to quickly switch between different sets of available packages. They also make it possible to be confident about exactly what packages are needed for a given application and share that information so that others can reproduce an environment without interfering with other environments they might need on that machine.
For example, if you need NumPy 1.16.3 for one project and NumPy 1.16.4 for a different project, your package manager can install both versions and just change which one is used when you execute
import numpy. This is much more convenient than uninstalling one version and installing the other every time you need to switch between the two projects.
To use conda virtual environments, we first have to set up conda to work with our shell. This requires restarting bash.
conda init bash conda config --set changeps1 False exit
The second line configures conda to refrain from its default behavior of printing the name of the current environment every time a command is run from the command line. You might find this setting preferable on your own computer, but it will be essential for us as we execute the bash cells in this section.
To create a new Anaconda virtual environment, use
conda create. To activate an environment, use
conda activate. (Note: this cell takes a few dozen seconds to run, and it prints quite a bit of text. The
--yes argument automatically answers "yes" when conda asks us whether we want to proceed)
conda create -n myenv python numpy=1.16.4 --yes conda activate myenv
We can check that our newly activated environment has NumPy but not Pandas.
echo "import numpy" > tmp.py echo "print(numpy.version.version)" >> tmp.py echo "import pandas" >> tmp.py python tmp.py
We can view all of the environments we've set up with
conda env list:
conda env list
Conda installation operations modify the current environment. For example, we can add pandas:
conda install pandas --yes python tmp.py
We can get a readable version of the current environment using
conda env export
The output of this command can saved to a file—customarily called
environment.yml— which can be used by others to replicate the environment. Just for practice, let's save the environment to a
conda env export > environment.yml conda remove -n myenv --all conda env create -f environment.yml
Note that we used the
-f argument to make
conda env create get the package list from the
environment.yml file rather than directly from the command line.
If you want a colleague to be able to reproduce the Python environment you used in a particular project, one convenient way to do that is to give them your
Other reproducibility solutions
We will close this section by mentioning two other solutions for the reproducibility problem. If you're working with a non-Conda Python installation, you can use pip together with virtualenv to reproduce the virtual environment functionality of Conda. You can also get pip to give you a list of the packages and versions available in the local virtual environment using
A much more general-purpose tool for achieving reproducibility is Docker. We'll discuss Docker more extensively in the final section in this course.