Including data in Python packages

Austin Bingham from Good With Computers

Every time I need to include data in a Python package, I find myself going in circles checking existing projects, blog posts, and every other resource I can find to figure out the right way to do it. For something so seemingly straightforward, including data in a package always turns into a bit of a mess for me.

I had to make a package today that contained data, so - since it involved the standard running in circles for an hour - I thought I'd take the time to write down how I finally got it to work.

What is "package data"?

Broadly, package data is any files that you want to include with your Python package that aren't Python source files. An example is a TOML default configuration file that you want to be able to produce for users. It's not Python source code, so it wouldn't normally be included in a Python package. But with just a small amount of work, you can include it in a package and make it available programatically to users of your package (or your package itself).

The short version

  1. Set include_package_data to True in your setup.py.
  2. Set package_dir in your setup.py.
  3. Include a MANIFEST.in that references your data files.

If that doesn't mean anything to you, read on.

The longer version

Suppose you have a project structure like this:

setup.py
source/
    project/
        __init__.py
        data/
            default_config.toml

It's a fairly standard structure, with the source directory containing the actual package files. The name of the package in this case is project.

What stands out is the data/default_config.toml file under project. This is our package data. That is, it's a non-Python file that we want to include in our package. Normally setuptools won't include it in the distributions you build (e.g. wheels, etc.), so we need to tell setuptools about it.

Create a MANIFEST.in

The first step is to create a new file, MANIFEST.in, as a sibling to setup.py. This file lets us specify the files that should be included in our distributions (beyond the files that are included by default). You can read more about it in the Python Packaging User guide.

At it's simplest (which works for me most of the time), it just needs to specify that your package should include anything and everything under some directory. In our case, we can include everything under source/project/data like this:

recursive-include source/project/data *

That's it. You can, of course, have much more complex include/exclude specs in MANIFEST.in, but this will get you started.

Update setup.py

You also need to modify setup.py to make sure it will let you include package data. Fortunately, in the normal case, this is very simple:

setup(
    ...
    include_package_data=True,
    package_dir={"": "source"},
    ...
)

Now when you install your package from source or generate wheels for distribution, everything in the data directory will be included in your package.

Accessing the package data

Including the package data is only half of the battle, though. You still need some way to access the files from your program. This is where pkg_resource comes in. pkg_resources lets you (among other things) get paths to the directories and files in your package data. I won't go into great detail here, but here's how you could get the path to the data directory at runtime:

pkg_resources.resource_filename("project", "data")

Or you could get a readable stream to the default_config.toml file:

stream = pkg_resources.resource_stream("project", "data/default_config.toml")
stream.read()

The pkg_resource docs linked above are excellent, so I'll leave it at that.

What did I get wrong or leave out?

There are much more sophisticated ways to use pkg_utils and package data, but I find that what I've described above seems to work well for most of what I need. If I got things wrong or left out important details, let me know!

Including data in Python packages

Austin Bingham from Good With Computers

Every time I need to include data in a Python package, I find myself going in circles checking existing projects, blog posts, and every other resource I can find to figure out the right way to do it. For something so seemingly straightforward, including data in a package always turns into a bit of a mess for me.

I had to make a package today that contained data, so - since it involved the standard running in circles for an hour - I thought I'd take the time to write down how I finally got it to work.

What is "package data"?

Broadly, package data is any files that you want to include with your Python package that aren't Python source files. An example is a TOML default configuration file that you want to be able to produce for users. It's not Python source code, so it wouldn't normally be included in a Python package. But with just a small amount of work, you can include it in a package and make it available programatically to users of your package (or your package itself).

The short version

  1. Set include_package_data to True in your setup.py.
  2. Set package_dir in your setup.py.
  3. Include a MANIFEST.in that references your data files.

If that doesn't mean anything to you, read on.

The longer version

Suppose you have a project structure like this:

setup.py
source/
    project/
        __init__.py
        data/
            default_config.toml

It's a fairly standard structure, with the source directory containing the actual package files. The name of the package in this case is project.

What stands out is the data/default_config.toml file under project. This is our package data. That is, it's a non-Python file that we want to include in our package. Normally setuptools won't include it in the distributions you build (e.g. wheels, etc.), so we need to tell setuptools about it.

Create a MANIFEST.in

The first step is to create a new file, MANIFEST.in, as a sibling to setup.py. This file lets us specify the files that should be included in our distributions (beyond the files that are included by default). You can read more about it in the Python Packaging User guide.

At it's simplest (which works for me most of the time), it just needs to specify that your package should include anything and everything under some directory. In our case, we can include everything under source/project/data like this:

recursive-include source/project/data *

That's it. You can, of course, have much more complex include/exclude specs in MANIFEST.in, but this will get you started.

Update setup.py

You also need to modify setup.py to make sure it will let you include package data. Fortunately, in the normal case, this is very simple:

setup(
    ...
    include_package_data=True,
    package_dir={"": "source"},
    ...
)

Now when you install your package from source or generate wheels for distribution, everything in the data directory will be included in your package.

Accessing the package data

Including the package data is only half of the battle, though. You still need some way to access the files from your program. This is where pkg_resource comes in. pkg_resources lets you (among other things) get paths to the directories and files in your package data. I won't go into great detail here, but here's how you could get the path to the data directory at runtime:

pkg_resources.resource_filename("project", "data")

Or you could get a readable stream to the default_config.toml file:

stream = pkg_resources.resource_stream("project", "data/default_config.toml")
stream.read()

The pkg_resource docs linked above are excellent, so I'll leave it at that.

What did I get wrong or leave out?

There are much more sophisticated ways to use pkg_utils and package data, but I find that what I've described above seems to work well for most of what I need. If I got things wrong or left out important details, let me know!

Suspending the computer using Kupfer

Andy Balaam from Andy Balaam's Blog

I have recently started using Kupfer again as my application launcher in Ubuntu MATE, and I found it lacked the ability to suspend the computer.

Here is the plugin I wrote to support this.

To install it, quit Kupfer, create a directory in your home dir called .local/share/kupfer/plugins, and create this file suspend.py inside:

__kupfer_name__ = _("Power management")
__kupfer_sources__ = ("PowerManagementItemsSource", )
__description__ = _("Actions to suspend the computer")
__version__ = "2021-05-05"
__author__ = "Andy Balaam "


from kupfer.plugin import session_support as support


class Suspend (support.CommandLeaf):
    def __init__(self, commands):
        support.CommandLeaf.__init__(self, commands, "Suspend")
    def get_description(self):
        return _("Suspend the computer")
    def get_icon_name(self):
        return "system-suspend"


class PowerManagementItemsSource (support.CommonSource):
	def __init__(self):
		support.CommonSource.__init__(self, _("Power management"))
	def get_items(self):
		return (Suspend((["systemctl", "suspend"],)),)

# Copyright 2021 Andy Balaam, released under the MIT license.

Now restart Kupfer, go to Preferences, Plugins, and tick “Power management”.

You should now see a “Suspend” item if you search for it in the Kupfer interface.

Inspired by: Mate Session Management – Kupfer Plugin.

Reference docs: Kupfer Plugin API

Python virtual environments with pyenv on Apple Silicon

Ekaterina Nikonova from Good With Computers

Apple's recent transition to the new architecture for its Mac computers has caused rather predictable problems for developers whose workflow depends on certain versions of pre-compiled libraries for x86 architecture. While the latest releases of Python come with a universal installer that allows to build universal binaries for M1 systems, those who prefer to manage Python environments with pyenv, may find it difficult to choose the correct version for installation.

This problem can be solved by installing both x86 and arm64 Python executables. To do that, we need to be able to run pyenv in x86 mode and make sure that all system dependencies are met for both architectures. In other words, we'll need both x86 and arm64 Homebrew packages that we'll keep separate using two installations of Homebrew.

First of all, to be able to run x86 executables, we'll need to install Rosetta:

$ softwareupdate —install-rosetta

Now we can install the x86 Homebrew:

$ arch -x86_64 /bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install.sh)"

It will be installed in the /usr/local/bin/ directory. For convenience, you can create an alias by adding the following line in your shell configuration file:

alias brew86="arch -x86_64 /usr/local/bin/brew"

Now we can invoke the x86 Homebrew as brew86 and install packages required by pyenv:

$ brew install openssl readline sqlite3 xz zlib

$ brew86 install openssl readline sqlite3 xz zlib

You can check whether the installation was successful and you have packages for both architectures using the file command, for example:

$ file /opt/homebrew/Cellar/openssl@1.1/1.1.1k/bin/openssl
/opt/homebrew/Cellar/openssl@1.1/1.1.1k/bin/openssl: Mach-O 64-bit executable arm64

$ file /usr/local/Cellar/openssl@1.1/1.1.1k/bin/openssl
/usr/local/Cellar/openssl@1.1/1.1.1k/bin/openssl: Mach-O 64-bit executable x86_64

To install x86 Python, you'll need to call pyenv with the arch -x86_64 prefix. For convenience, let's create an alias for this command by adding the following line in the shell config file:

alias pyenv86="arch -x86_64 pyenv"

Now you can install x86 Python binaries by calling:

$ pyenv86 install <PYTHON_VERSION>

By default, pyenv doesn't allow you to specify custom names for the installed Python versions, but you can use the pyenv-alias plugin to give your installations appropriate names:

$ VERSION_ALIAS="3.x.x_x86" pyenv86 install 3.x.x

Note that with aliases for your pyenv and Homebrew installations, you’ll have to specify them in all commands and locations, for example:

$ CFLAGS="-I$(brew86 --prefix openssl)/include" \
LDFLAGS="-L$(brew86 --prefix openssl)/lib" \
VERSION_ALIAS="3.x.x_x86" \
pyenv86 install -v 3.x.x

Toggle window decorations on Linux GTK3 with Python3

Andy Balaam from Andy Balaam&#039;s Blog

The Internet is full of outdated Python code for doing things with windows, so here is what I got working today in a Python 3, GTK 3 environment.

This script toggles the window decorations on the active window on and off. I have it bound to Ctrl+NumPadMinus for easy access.

#!/usr/bin/env python3

import gi
gi.require_version('Gdk', '3.0')
gi.require_version('GdkX11', '3.0')
gi.require_version('Wnck', '3.0')
from gi.repository import Gdk
from gi.repository import GdkX11
from gi.repository import Wnck


def active_window(screen):
    for window in screen.get_windows():
       if window.is_active() == True:
            return window


def toggle_decorations(w):
    if w.get_decorations().decorations == 0:
        w.set_decorations(Gdk.WMDecoration.ALL)
    else:
        w.set_decorations(0)


screen = Wnck.Screen.get_default()
screen.force_update()
display = GdkX11.X11Display.get_default()
window = active_window(screen)
window_id = window.get_xid()

w = GdkX11.X11Window.foreign_new_for_display(display, window_id)
toggle_decorations(w)


window = None
screen = None
Wnck.shutdown()

Growth in number of packages for widely used languages

Derek Jones from The Shape of Code

These days a language’s ecosystem of add-ons, such as packages, is often more important than the features provided by the language (which usually only vary in their syntactic sugar, and built-in support for some subset of commonly occurring features).

Use of a particular language grows and shrinks, sometimes over very many decades. Estimating the number of users of a language is difficult, but a possible proxy is ecosystem activity in the form of package growth/decline. However, it will take many several decades for the data needed to test how effective this proxy might be.

Where are we today?

The Module Counts website is the home for a project that counts the number of libraries/packages/modules contained in 26 language specific repositories. Daily data, in some cases going back to 2010, is available as a csv :-) The following are the most interesting items I discovered during a fishing expedition.

The csv file contains totals, and some values are missing (which means specifying an ‘ignore missing values’ argument to some functions). Some repos have been experiencing large average daily growth (e.g., 65 for PyPI, and 112 for Maven Central-Java), while others are more subdued (e.g., 0.7 for PERL and 3.9 for R’s CRAN). Apart from a few days, the daily change is positive.

Is the difference in the order of magnitude growth due to number of active users, number of packages that currently exist, a wide/narrow application domain (Python is wide, while R’s is narrow), the ease of getting a package accepted, or something else?

The plots below show how PyPI has been experiencing exponential growth of a kind (the regression model fitted to the daily total has the form e^{1.01days+days^2}, where days is the number of days since 2010-01-01; the red line is the daily diff of this equation), while Ruby has been experiencing a linear decline since late 2014 (all code+data):

Daily change in the number of packages in PyPI and Rubygems.

Will the five-year decline in new submissions to Rubygems continue, and does this point to an eventual demise of Ruby (a few decades from now)? Rubygems has years to go before it reaches PERL’s low growth rate (I think PERL is in terminal decline).

Are there any short term patterns, say at the weekly level? Autocorrelation is a technique for estimating the extent to which today’s value is affected by values from the immediate past (usually one or two measurement periods back, i.e., yesterday or the day before that). The two plots below show the autocorrelation for daily changes, with lag in days:

Autocorrelation of daily changes in PyPI and Maven-Java package counts.

The recurring 7-day ‘peaks’ show the impact of weekends (I assume). Is the larger ”weekend-effect’ for Java, compared to PyPI, due to Java usage including a greater percentage of commercial developers (who tend not to work at the weekend)?

I did not manage to find any seasonal effect, e.g., more submissions during the winter than the summer. But I only checked a few of the languages, and only for a single peak (see code for details).

Another way of tracking package evolution is version numbering. For instance, how often do version numbers change, and which component, e.g., major/minor. There have been a couple of studies looking at particular repos over a few years, but nobody is yet recording broad coverage daily, over the long term 😉

Exercises in Programming Style: the python way

Derek Jones from The Shape of Code

Exercises in Programming Style by Cristina Lopes is an interesting little book.

The books I have previously read on programming style pick a language, and then write various programs in that language using different styles, idioms, or just following quirky rules, e.g., no explicit loops, must use sets, etc. “Algorithms in Snobol 4” by James F. Gimpel is a fascinating read, but something of an acquired taste.

EPS does pick a language, Python, but the bulk of the book is really a series of example programs illustrating a language feature/concept that is central to a particular kind of language, e.g., continuation-passing style, publish-subscribe architecture, and reflection. All the programs implement the same problem: counting the number of occurrences of each word in a text file (Jane Austin’s Pride and Prejudice is used).

The 33 chapters are each about six or seven pages long, and contain a page or two or code. Everything is very succinct, and does a good job of illustrating one main idea.

While the first example does not ring true, things quickly pick up and there are lots of interesting insights to be had. The first example is based on limited storage (1,024 bytes), and just does not make efficient use of the available bits (e.g., upper case letters can be represented using 5-bits, leaving three unused bits or 37% of available storage; a developer limited to 1K would not waste such a large amount of storage).

Solving the same problem in each example removes the overhead of having to learn what is essentially housekeeping material. It also makes it easy to compare the solutions created using different ideas. The downside is that there is not always a good fit between the idea being illustrated and the problem being solved.

There is one major omission. Unstructured programming; back in the day it was just called programming, but then structured programming came along, and want went before was called unstructured. Structured programming allowed a conditional statement to apply to multiple statements, an obviously simple idea once somebody tells you.

When an if-statement can only be followed by a single statement, that statement has to be a goto; an if/else is implemented as (using Fortran, I wrote lots of code like this during my first few years of programming):

      IF (I .EQ. J)
      GOTO 100
      Z=1
      GOTO 200
100   Z=2
200

Based on the EPS code in chapter 3, Monolithic, an unstructured Python example might look like (if Python supported goto):

for line in open(sys.argv[1]):
    start_char = None
    i = 0
    for c in line:
        if start_char != None:
           goto L0100
        if not c.isalnum():
           goto L0300
        # We found the start of a word
        start_char = i
        goto L0300
        L0100:
        if c.isalnum():
           goto L0300
        # We found the end of a word. Process it
        found = False
        word = line[start_char:i].lower()
        # Ignore stop words
        if word in stop_words:
           goto L0280
        pair_index = 0
        # Let's see if it already exists
        for pair in word_freqs:
            if word != pair[0]:
               goto L0210
            pair[1] += 1
            found = True
            goto L0220
            L0210:
            pair_index += 1
        L0220:
        if found:
           goto L0230
        word_freqs.append([word, 1])
        goto L0300
        L0230:
        if len(word_freqs) <= 1:
           goto L0300:
        # We may need to reorder
        for n in reversed(range(pair_index)):
            if word_freqs[pair_index][1] <= word_freqs[n][1]:
               goto L0240
            # swap
            word_freqs[n], word_freqs[pair_index] = word_freqs[pair_index], word_freqs[n]
            pair_index = n
            L0240:
        goto L0300
        L0280:
        # Let's reset
        start_char = None
        L0300:
        i += 1

If you do feel a yearning for the good ol days, a goto package is available, enabling developers to write code such as:

from goto import with_goto

@with_goto
def range(start, stop):
    i = start
    result = []

    label .begin
    if i == stop:
        goto .end

    result.append(i)
    i += 1
    goto .begin

    label .end
    return result

Student projects for 2019/2020

Derek Jones from The Shape of Code

It’s that time of year when students are looking for an interesting idea for a project (it might be a bit late for this year’s students, but I have been mulling over these ideas for a while, and might forget them by next year). A few years ago I listed some suggestions for student projects, as far as I know none got used, so let’s try again…

Checking the correctness of the Python compilers/interpreters. Lots of work has been done checking C compilers (e.g., Csmith), but I cannot find any serious work that has done the same for Python. There are multiple Python implementations, so it would be possible to do differential testing, another possibility is to fuzz test one or more compiler/interpreter and see how many crashes occur (the likely number of remaining fault producing crashes can be estimated from this data).

Talking to the Python people at the Open Source hackathon yesterday, testing of the compiler/interpreter was something they did not spend much time thinking about (yes, they run regression tests, but that seemed to be it).

Finding faults in published papers. There are tools that scan source code for use of suspect constructs, and there are various ways in which the contents of a published paper could be checked.

Possible checks include (apart from grammar checking):

Number extraction. Numbers are some of the most easily checked quantities, and anybody interested in fact checking needs a quick way of extracting numeric values from a document. Sometimes numeric values appear as numeric words, and dates can appear as a mixture of words and numbers. Extracting numeric values, and their possible types (e.g., date, time, miles, kilograms, lines of code). Something way more sophisticated than pattern matching on sequences of digit characters is needed.

spaCy is my tool of choice for this sort of text processing task.

London Python Meetup January 2019 – Async Python and GeoPandas

Andy Balaam from Andy Balaam&#039;s Blog

It was a pleasure to go to the London Python Meetup organised by @python_london. There were plenty of friendly people and interesting conversations.

I gave a talk “Making 100 million requests with Python aiohttp” (slides) explaining the basics of writing async code in Python 3 and how I used that to make a very large number of HTTP requests.

Andy giving the presentation

(Photo by CB Bailey.)

Hopefully it was helpful – there were several good questions, so I am optimistic that people were engaged with it.

After that, there was an excellent talk by Gareth Lloyd called “GeoPandas, the geospatial extension for Pandas” in which he explained how to use the very well-developed geo-spatial data tools available in the Python ecosphere to transform, combine, plot and analyse data which includes location information. I was really impressed with how easy the libraries looked to use, and also with the cool Jupyter notebook Gareth used to explain the ideas using live demos.

London Python Meetups seem like a cool place to meet Pythonistas of all levels of experience in a nice, low-pressure environment!

Meetup link: aiohttp / GeoPandas