Including data in Python packages

Austin Bingham from Good With Computers

Every time I need to include data in a Python package, I find myself going in circles checking existing projects, blog posts, and every other resource I can find to figure out the right way to do it. For something so seemingly straightforward, including data in a package always turns into a bit of a mess for me.

I had to make a package today that contained data, so - since it involved the standard running in circles for an hour - I thought I'd take the time to write down how I finally got it to work.

What is "package data"?

Broadly, package data is any files that you want to include with your Python package that aren't Python source files. An example is a TOML default configuration file that you want to be able to produce for users. It's not Python source code, so it wouldn't normally be included in a Python package. But with just a small amount of work, you can include it in a package and make it available programatically to users of your package (or your package itself).

The short version

  1. Set include_package_data to True in your setup.py.
  2. Set package_dir in your setup.py.
  3. Include a MANIFEST.in that references your data files.

If that doesn't mean anything to you, read on.

The longer version

Suppose you have a project structure like this:

setup.py
source/
    project/
        __init__.py
        data/
            default_config.toml

It's a fairly standard structure, with the source directory containing the actual package files. The name of the package in this case is project.

What stands out is the data/default_config.toml file under project. This is our package data. That is, it's a non-Python file that we want to include in our package. Normally setuptools won't include it in the distributions you build (e.g. wheels, etc.), so we need to tell setuptools about it.

Create a MANIFEST.in

The first step is to create a new file, MANIFEST.in, as a sibling to setup.py. This file lets us specify the files that should be included in our distributions (beyond the files that are included by default). You can read more about it in the Python Packaging User guide.

At it's simplest (which works for me most of the time), it just needs to specify that your package should include anything and everything under some directory. In our case, we can include everything under source/project/data like this:

recursive-include source/project/data *

That's it. You can, of course, have much more complex include/exclude specs in MANIFEST.in, but this will get you started.

Update setup.py

You also need to modify setup.py to make sure it will let you include package data. Fortunately, in the normal case, this is very simple:

setup(
    ...
    include_package_data=True,
    package_dir={"": "source"},
    ...
)

Now when you install your package from source or generate wheels for distribution, everything in the data directory will be included in your package.

Accessing the package data

Including the package data is only half of the battle, though. You still need some way to access the files from your program. This is where pkg_resource comes in. pkg_resources lets you (among other things) get paths to the directories and files in your package data. I won't go into great detail here, but here's how you could get the path to the data directory at runtime:

pkg_resources.resource_filename("project", "data")

Or you could get a readable stream to the default_config.toml file:

stream = pkg_resources.resource_stream("project", "data/default_config.toml")
stream.read()

The pkg_resource docs linked above are excellent, so I'll leave it at that.

What did I get wrong or leave out?

There are much more sophisticated ways to use pkg_utils and package data, but I find that what I've described above seems to work well for most of what I need. If I got things wrong or left out important details, let me know!

Including data in Python packages

Austin Bingham from Good With Computers

Every time I need to include data in a Python package, I find myself going in circles checking existing projects, blog posts, and every other resource I can find to figure out the right way to do it. For something so seemingly straightforward, including data in a package always turns into a bit of a mess for me.

I had to make a package today that contained data, so - since it involved the standard running in circles for an hour - I thought I'd take the time to write down how I finally got it to work.

What is "package data"?

Broadly, package data is any files that you want to include with your Python package that aren't Python source files. An example is a TOML default configuration file that you want to be able to produce for users. It's not Python source code, so it wouldn't normally be included in a Python package. But with just a small amount of work, you can include it in a package and make it available programatically to users of your package (or your package itself).

The short version

  1. Set include_package_data to True in your setup.py.
  2. Set package_dir in your setup.py.
  3. Include a MANIFEST.in that references your data files.

If that doesn't mean anything to you, read on.

The longer version

Suppose you have a project structure like this:

setup.py
source/
    project/
        __init__.py
        data/
            default_config.toml

It's a fairly standard structure, with the source directory containing the actual package files. The name of the package in this case is project.

What stands out is the data/default_config.toml file under project. This is our package data. That is, it's a non-Python file that we want to include in our package. Normally setuptools won't include it in the distributions you build (e.g. wheels, etc.), so we need to tell setuptools about it.

Create a MANIFEST.in

The first step is to create a new file, MANIFEST.in, as a sibling to setup.py. This file lets us specify the files that should be included in our distributions (beyond the files that are included by default). You can read more about it in the Python Packaging User guide.

At it's simplest (which works for me most of the time), it just needs to specify that your package should include anything and everything under some directory. In our case, we can include everything under source/project/data like this:

recursive-include source/project/data *

That's it. You can, of course, have much more complex include/exclude specs in MANIFEST.in, but this will get you started.

Update setup.py

You also need to modify setup.py to make sure it will let you include package data. Fortunately, in the normal case, this is very simple:

setup(
    ...
    include_package_data=True,
    package_dir={"": "source"},
    ...
)

Now when you install your package from source or generate wheels for distribution, everything in the data directory will be included in your package.

Accessing the package data

Including the package data is only half of the battle, though. You still need some way to access the files from your program. This is where pkg_resource comes in. pkg_resources lets you (among other things) get paths to the directories and files in your package data. I won't go into great detail here, but here's how you could get the path to the data directory at runtime:

pkg_resources.resource_filename("project", "data")

Or you could get a readable stream to the default_config.toml file:

stream = pkg_resources.resource_stream("project", "data/default_config.toml")
stream.read()

The pkg_resource docs linked above are excellent, so I'll leave it at that.

What did I get wrong or leave out?

There are much more sophisticated ways to use pkg_utils and package data, but I find that what I've described above seems to work well for most of what I need. If I got things wrong or left out important details, let me know!