Ideas on how lexing will work in Pepper3

Andy Balaam from Andy Balaam's Blog

I am trying to practice documentation-driven development in Pepper3, so every time I start on an area, I will write documentation explaining how it works, and include examples that are automatically verified during the build.

I’ve started work on lexing, since you can’t do much before you do that, but in fact, of course, I need to have a command line interface before I can verify any of the examples, so I’m working on that too.

Lexing is the process that takes a stream of characters (e.g. from a file) and turns it into a stream of “tokens” that are chunks of code like a variable name, a number or a string. (There is more on lexing in my mini programming language, Cell.)

My thoughts so far about lexing are in lexing.md, and current ideas about command line interface are at command_line.md. All very much subject to change.

Headlines:

  • Ordinary programmers can write their own lexing rules.
  • Operators (functions like “+” that find their arguments on their left and right, instead of between brackets like normal functions) are defined at the lexing phase, so any symbol (e.g. “in”) can be an operator if you want.
  • Anything you might want to do with a pepper program, including running it, compiling it, packaging it for an distribution system, should be available as a sub-command of the main pepper3 command line.
  • The command is “pepper3”, never “pepper”. If a new, incompatible version comes out, it will be called “pepper4”, and they will be parallel-installable, with no confusion.

Questions and answers about Pepper3

Andy Balaam from Andy Balaam's Blog

Series: Examples, Questions

My last post Examples of Pepper3 code was a reply to my friend’s email asking what it was all about. They replied with some questions, and I thought the questions and answers might shed some more light:

Questions!

Brilliant ones, thanks.

In general though you’ve said a lot about what Pepper can do without giving design decisions.

Yep, total brain dump.

Remind me again who this language is for :)

It’s a multi-paradigm (generic, functional, OO) language aimed at application programmers who want:

  • “native” performance on their chosen platform (definitely including actual native machine code). This is inspired by C++.
  • easy deployment (preferably a single binary containing everything, with an option to link most dependencies statically), including packaging of installers for major OSes. This is inspired by C++, and the pain of C++.
  • perfect flexibility for creating types – “meta-programming” is just programming. Things you would have done using code generation (e.g. generating a class hierarchy from an XSD) are done by running arbitrary code at compile time. The powerful type system is inspired by Haskell and the book “Modern C++ Design”, and the meta-programming is inspired by Lisp.
  • Simple memory management without GC through ownership. This is inspired by modern C++, and then Rust came along and implemented it before I could, thus proving it works. However, I would remove a lot of the functionality in Rust (lifetimes) to make it much simpler.
  • Strong support for functional programming if you want it. This is inspired by Haskell.
  • The simplest possible core language, with application programmers able to expand it by giving them the same tools as the language designers – e.g. “for” is just a function, so you can make your own. I am hoping I can even make “class” a function. This is inspired by Lisp, and oppositely-inspired by Java.
  • Separation between the idea of Interfaces, which I think I will call “type specifiers” (and will allow arbitrary code execution to determine whether a type satisfies the requirements) and structs/classes, allowing us to make new Interfaces and have old code satisfy them, meaning we can do generic stuff with e.g. ints even if
    no-one declared that “class Int : public Quaternion” or whatever.
  • Lots of “nudges” towards things that are good: by default things will be functional and immutable – you will have to explicitly say if you want to use more dangerous constructs like side effects and mutable values.
  • No implicit conversions, or really anything happening without you saying so.

Can you assign floats to ints or vice versa?

Yes, but you shouldn’t.

If you’re setting types in code at the start of a file, is this only available in the main file? Are there multiple files per program? Can
you have libraries? If so, do these decide the functionality of their types in the library or does this only happen in the main file?

I haven’t totally decided – either by being enforced, or as a matter of style, you will generally do this once at the beginning of the program (and choose on the compiler command line to do it e.g. the debug way or the release way) and it will affect all of your code.

Libraries will be packaged as Pepper3 source code, so choices you make of the type of Int etc. will be reflected through the whole dependency tree. Cool, huh?

This is inspired by Python.

Can you group variables together into structs or similar?

Yes – it will be especially easy to make “value types”, and lots of default methods will be provided, that you will be strongly encouraged to use – e.g. copy and move operations. This is inspired by Elm.

Why are variables immutable by default but mutable with a special syntax? It’s the opposite of C++ const, but why that way around?

This is one of the “nudges” – immutable stuff is much easier to think about, and makes parallel stuff easier, and allows optimisations and so on, so turning it on by default means you have to choose to take the bad path, and are inclined to take the virtuous one. This is inspired by Haskell and Rust.

Why only allow assignments, function calls and operators? I’m sure you have good reasons.

To be as simple as possible, so you only have those things to learn and the rest can be understood by just reading the code. This is inspired by Python.

I wrote more of my (earlier) thoughts in this 4-post series, which is better thought through: Goodness in Programming Languages

Constantly Confusing: C++ const and constexpr pointer behaviour

Samathy from Stories by Samathy on Medium

A quick explanation of how const and constexpr work on pointers in C++

So I was checking that my knowledge was correct when working on a Firefox bug.
I made a quick C++ file with all the examples I know of how to use const and constexpr on pointers.
As one can see, its pretty confusing!

Because there are several places in a statement where you can put ‘const’ it can be complicated to work out what part of your statement the ‘const’ is referring too.
Generally, its best to read from right to left to work it out. i.e:

static const char * const hello;

Would read like:

hello (is a) const pointer (to) const char

But, that takes a bit of practice.

C++’s constexpr brings another new dimension to the problem too!
It behaves like const in the sense that it makes all pointers constant pointers.
But because it occurs at the start of your statement (rather than after the ‘*’) its not immediately obvious.

Heres my list of all the ways you can use const and constexpr on pointers and how they behave.

Working with PDF Highlight Annotations Programmatically

Samathy from Stories by Samathy on Medium

PDFs are the format of choice in academia, but extracting the information they contain is annoyingly hard.

I’ve just started working on my degree’s final project. An academic project requires lots of research, which means reading lots of papers.
Papers are normally available in one form only, PDF.

While PDF is a format so ubiquitous nowadays that one can guarantee being able to display it as the writer(s) intended, its not a nice format, as I found out as soon as I needed to do something with it.

During the course of my research, I’ve been using PDF’s highlight annotations to highlight parts of a paper that’re particularly interesting.
I wanted to be able to retrieve the highlighted text at a later date so I didn’t have to open the paper again to find the parts I found interesting when I read it the first time.

You’d think that exporting annotations on text would be something that all PDF readers which support annotations (most of them do) would be capable of. I mean, surely its easy enough even if there arnt that many reasons why you’d want to do it.

Alas, none that I found running on Linux had this feature, so I delved into trying to write something to do what I needed.

I based my project on a tool I found in a StackOverflow answer to a question similar to mine.
The Python code in the answer utilises poppler-qt4 to export annotated text from a PDF. Unfortunately, the code is Python2 and the python poppler-qt4 package wouldn't install properly on my system anyway, even after installing the poppler-qt4 package.
Neither did Python’s poppler-qt5 bindings.

Convinced I could do a better job than a Python 2 script which depended on a package last updated in 2015, I translated the answer into the equivalent in C++.

I started with trying to use poppler-cpp, the C++ bindings for poppler where one has objects and namespaces, and none of the guff associated with GUI frameworks that I wouldn't need here. However, to my dismay, poppler-cpp doesn't support annotations at all. For whatever reason, annotation support only works with the bindings to a GUI framework, like glib or QT.

So instead I used poppler-glib (i.e glib from the GNOME project). Purely because I use GNOME, so wouldn't have to install anything extra.

Now, the PDF format is really odd. Annotations seem to be an after-thought to the format tacked on later.
Specifically highlighting is weird, because a highlight annotation has no connection to the document’s text.
As such, poppler’s poppler_annot_get_contents(PopplerAnnot *) which should return the annotation’s contents, returns nothing.
Instead, to get the text associated with a highlight annotation, one has to get the coordinates of the highlight annotation (A PopplerRectangle) and then utilise the function poppler_page_get_text_for_area(PopplerPage*, PopplerRectangle*) which returns the text in a defined area.

What an entirely baffling way to go about implementing highlighting. Attaching it as purely a visual element, rather than actually marking up the text.

Even more baffling is the fact that although my application works, it only mostly works.
Sometimes I get the full text highlighted, other times it chops off characters, and sometimes it adds things that’re nowhere near the highlighted text at all!
This is a problem I’m yet to solve, and I might never solve, because its ridiculous and the tool mostly does what I needed anyway.

In conclusion; The PDF format is weird, I wrote a thing.
If you use it, let me know how it goes!

https://github.com/Samathy/pdfcommentextractor

HTML5 CSS Toolbar + zoomable workspace that is mobile-friendly and adaptive

Andy Balaam from Andy Balaam's Blog

I have been working on a prototype level editor for Rabbit Escape, and I’ve had trouble getting the layout I wanted: a toolbar at the side or top of the screen, and the rest a zoomable workspace.

Something like this is very common in many desktop applications, but not that easy to achieve in a web page, especially because we want to take care that it adapts to different screen sizes and orientations, and, for example, allows zooming the toolbar buttons in case we find ourselves on a device with different resolution from what we were expecting.

In the end I’ve gone with a grid-layout solution and accepted the fact that sometimes on mobile devices when I zoom in my toolbar will disappear off the top/side. When I scroll back to it, it stays around, so using this setup is quite natural. On the desktop, it works how you’d expect, with the toolbar staying on screen at all zoom levels.

Here’s how it looks on a landscape display:

and portrait:

Read the full source code.

As you can see from the code linked above, after much fiddling I managed to achieve this with a relatively small amount of CSS, and no JavaScript. I’m hoping it will behave well in unexpected scenarios, because the code expresses what I want fairly closely.

The important bits of the HTML are simple – a main div, a toolbar containing buttons, and a workspace containing some kind of work:

<div id="main">
    <div id="toolbar">
        <button></button><button></button><button></button><button></button><button></button><button></button><button></button><button></button>
    </div>
    <div id="workspace">
        <div id="work">
        </div>
    </div>
</div>

The keys bits of the CSS are:

/* Ensure we take up the full height of the page. */
html, body, #main
{
    height: 100%;
}

@media all and (orientation:landscape)
{
    /* On a wide screen, it's a grid with 2 columns,
       and the toolbar can scroll downwards. */
    #main
    {
        display: grid;
        grid-template-columns: 5em 1fr;
    }
    #toolbar
    {
        overflow-x: hidden;
        overflow-y: auto;
    }
}

@media all and (orientation:portrait)
{
    /* On a tall screen, it's a grid with 2 rows,
       and the toolbar can scroll right. */
    #main
    {
        display: grid;
        grid-template-rows: 5em 1fr;
    }
    #toolbar
    {
        overflow-x: auto;
        overflow-y: hidden;
        white-space: nowrap;
    }
}

That replaces an awful lot of code in my first attempt, so I’m reasonably happy. If anyone has suggestions about how to make “100%” really mean 100% of the real device width and height, let me know. If I do some JavaScript I can make Mobile Firefox fit to the real screen size, but Mobile Chrome (and, I assume, Mobile Safari) lie to me about the screen size when zoomed in.

Running a virtualenv with a custom-built Python

Andy Balaam from Andy Balaam&#039;s Blog

For my attempt to improve the asyncio.as_completed Python standard library function I needed to build a local copy of cpython (the Python interpreter).

To test it, I needed the aiohttp module, which is not part of the standard library, so the easiest way to get it was using virtualenv.

Here is the recipe I used to get a virtualenv and install packages using pip with a custom-built Python:

$ ~/code/public/cpython/python -m venv env
$ . env/bin/activate
(env) $ pip install aiohttp
(env) $ python mycode.py

Adding a concurrency limit to Python’s asyncio.as_completed

Andy Balaam from Andy Balaam&#039;s Blog

Series: asyncio basics, large numbers in parallel, parallel HTTP requests, adding to stdlib

In the previous post I demonstrated how the limited_as_completed method allows us to run a very large number of tasks using concurrency, but limiting the number of concurrent tasks to a sensible limit to ensure we don’t exhaust resources like memory or operating system file handles.

I think this could be a useful addition to the Python standard library, so I have been working on a modification to the current asyncio.as_completed method. My work so far is here: limited-as_completed.

I ran similar tests to the ones I ran for the last blog post with this code to validate that the modified standard library version achieves the same goals as before.

I used an identical copy of timed from the previous post and updated versions of the other files because I was using a much newer version of aiohttp along with the custom-built python I was running.

server looked like:

#!/usr/bin/env python3

from aiohttp import web
import asyncio
import random

async def handle(request):
    await asyncio.sleep(random.randint(0, 3))
    return web.Response(text="Hello, World!")

app = web.Application()
app.router.add_get('/{name}', handle)

web.run_app(app)

client-async-sem needed me to add a custom TCPConnector to avoid a new limit on the number of concurrent connections that was added to aiohttp in version 2.0. I also need to move the ClientSession usage inside a coroutine to avoid a warning:

#!/usr/bin/env python3

from aiohttp import ClientSession, TCPConnector
import asyncio
import sys

limit = 1000

async def fetch(url, session):
    async with session.get(url) as response:
        return await response.read()

async def bound_fetch(sem, url, session):
    # Getter function with semaphore.
    async with sem:
        await fetch(url, session)

async def run(r):
    with ClientSession(connector=TCPConnector(limit=limit)) as session:
        url = "http://localhost:8080/{}"
        tasks = []
        # create instance of Semaphore
        sem = asyncio.Semaphore(limit)
        for i in range(r):
            # pass Semaphore and session to every GET request
            task = asyncio.ensure_future(
                bound_fetch(sem, url.format(i), session))
            tasks.append(task)
        responses = asyncio.gather(*tasks)
        await responses

loop = asyncio.get_event_loop()
loop.run_until_complete(asyncio.ensure_future(run(int(sys.argv[1]))))

My new code that uses my proposed extension to as_completed looked like:

#!/usr/bin/env python3

from aiohttp import ClientSession, TCPConnector
import asyncio
import sys

async def fetch(url, session):
    async with session.get(url) as response:
        return await response.read()

limit = 1000

async def print_when_done():
    with ClientSession(connector=TCPConnector(limit=limit)) as session:
        tasks = (fetch(url.format(i), session) for i in range(r))
        for res in asyncio.as_completed(tasks, limit=limit):
            await res

r = int(sys.argv[1])
url = "http://localhost:8080/{}"
loop = asyncio.get_event_loop()
loop.run_until_complete(print_when_done())
loop.close()

and with these, we get similar behaviour to the previous post:

$ ./timed ./client-async-sem 10000
Memory usage: 73640KB	Time: 19.18 seconds
$ ./timed ./client-async-stdlib 10000
Memory usage: 49332KB	Time: 18.97 seconds

So the implementation I plan to submit to the Python standard library appears to work well. In fact, I think it is better than the one I presented in the previous post, because it uses on_complete callbacks to notice when futures have completed, which reduces the busy-looping we were doing to check for and yield finished tasks.

The Python issue is bpo-30782 and the pull request is #2424.

Note: at first glance, it looks like the aiohttp.ClientSession‘s limit on the number of connections (introduced in version 1.0 and then updated in version 2.0) gives us what we want without any of this extra code, but in fact it only limits the number of connections, not the number of futures we are creating, so it has the same problem of unbounded memory use as the semaphore-based implementation.

Making 100 million requests with Python aiohttp

Andy Balaam from Andy Balaam&#039;s Blog

Series: asyncio basics, large numbers in parallel, parallel HTTP requests, adding to stdlib

I’ve been working on how to make a very large number of HTTP requests using Python’s asyncio and aiohttp.

Paweł Miech’s post Making 1 million requests with python-aiohttp taught me how to think about this, and got us a long way, with 1 million requests running in a reasonable time, but I need to go further.

Paweł’s approach limits the number of requests that are in progress, but it uses an unbounded amount of memory to hold the futures that it wants to execute.

We can avoid using unbounded memory by using the limited_as_completed function I outined in my previous post.

Setup

Server

We have a server program “server”:

(Note it differs from Paweł’s version because I am using an older version of aiohttp which has fewer convenient features.)

#!/usr/bin/env python3.5

from aiohttp import web
import asyncio
import random

async def handle(request):
    await asyncio.sleep(random.randint(0, 3))
    return web.Response(text="Hello, World!")

async def init():
    app = web.Application()
    app.router.add_route('GET', '/{name}', handle)
    return await loop.create_server(
        app.make_handler(), '127.0.0.1', 8080)

loop = asyncio.get_event_loop()
loop.run_until_complete(init())
loop.run_forever()

This just responds “Hello, World!” to every request it receives, but after an artificial delay of 0-3 seconds.

Synchronous client

As a baseline, we have a synchronous client “client-sync”:

#!/usr/bin/env python3.5

import requests
import sys

url = "http://localhost:8080/{}"
for i in range(int(sys.argv[1])):
    requests.get(url.format(i)).text

This waits for each request to complete before making the next one. Like the other clients below, it takes the number of requests to make as a command-line argument.

Async client using semaphores

Copied mostly verbatim from Making 1 million requests with python-aiohttp we have an async client “client-async-sem” that uses a semaphore to restrict the number of requests that are in progress at any time to 1000:

#!/usr/bin/env python3.5

from aiohttp import ClientSession
import asyncio
import sys

limit = 1000

async def fetch(url, session):
    async with session.get(url) as response:
        return await response.read()

async def bound_fetch(sem, url, session):
    # Getter function with semaphore.
    async with sem:
        await fetch(url, session)

async def run(session, r):
    url = "http://localhost:8080/{}"
    tasks = []
    # create instance of Semaphore
    sem = asyncio.Semaphore(limit)
    for i in range(r):
        # pass Semaphore and session to every GET request
        task = asyncio.ensure_future(bound_fetch(sem, url.format(i), session))
        tasks.append(task)
    responses = asyncio.gather(*tasks)
    await responses

loop = asyncio.get_event_loop()
with ClientSession() as session:
    loop.run_until_complete(asyncio.ensure_future(run(session, int(sys.argv[1]))))

Async client using limited_as_completed

The new client I am presenting here uses limited_as_completed from the previous post. This means it can make a generator that provides the futures to wait for as they are needed, instead of making them all at the beginning.

It is called “client-async-as-completed”:

#!/usr/bin/env python3.5

from aiohttp import ClientSession
import asyncio
from itertools import islice
import sys

def limited_as_completed(coros, limit):
    futures = [
        asyncio.ensure_future(c)
        for c in islice(coros, 0, limit)
    ]
    async def first_to_finish():
        while True:
            await asyncio.sleep(0)
            for f in futures:
                if f.done():
                    futures.remove(f)
                    try:
                        newf = next(coros)
                        futures.append(
                            asyncio.ensure_future(newf))
                    except StopIteration as e:
                        pass
                    return f.result()
    while len(futures) > 0:
        yield first_to_finish()

async def fetch(url, session):
    async with session.get(url) as response:
        return await response.read()

limit = 1000

async def print_when_done(tasks):
    for res in limited_as_completed(tasks, limit):
        await res

r = int(sys.argv[1])
url = "http://localhost:8080/{}"
loop = asyncio.get_event_loop()
with ClientSession() as session:
    coros = (fetch(url.format(i), session) for i in range(r))
    loop.run_until_complete(print_when_done(coros))
loop.close()

Again, this limits the number of requests to 1000.

Test setup

Finally, we have a test runner script called “timed”:

#!/usr/bin/env bash

./server &
sleep 1 # Wait for server to start

/usr/bin/time --format "Memory usage: %MKB\tTime: %e seconds" "$@"

# %e Elapsed real (wall clock) time used by the process, in seconds.
# %M Maximum resident set size of the process in Kilobytes.

kill %1

This runs each process, ensuring the server is restarted each time it runs, and prints out how long it took to run, and how much memory it used.

Results

When making only 10 requests, the async clients worked faster because they launched all the requests simultaneously and only had to wait for the longest one (3 seconds). The memory usage of all three clients was fine:

$ ./timed ./client-sync 10
Memory usage: 20548KB	Time: 15.16 seconds
$ ./timed ./client-async-sem 10
Memory usage: 24996KB	Time: 3.13 seconds
$ ./timed ./client-async-as-completed 10
Memory usage: 23176KB	Time: 3.13 seconds

When making 100 requests, the synchronous client was very slow, but all three clients worked eventually:

$ ./timed ./client-sync 100
Memory usage: 20528KB	Time: 156.63 seconds
$ ./timed ./client-async-sem 100
Memory usage: 24980KB	Time: 3.21 seconds
$ ./timed ./client-async-as-completed 100
Memory usage: 24904KB	Time: 3.21 seconds

At this point let’s agree that life is too short to wait for the synchronous client.

When making 10000 requests, both async clients worked quite quickly, and both had increased memory usage, but the semaphore-based one used almost twice as much memory as the limited_as_completed version:

$ ./timed ./client-async-sem 10000
Memory usage: 77912KB	Time: 18.10 seconds
$ ./timed ./client-async-as-completed 10000
Memory usage: 46780KB	Time: 17.86 seconds

For 1 million requests, the semaphore-based client took 25 minutes on my (32GB RAM) machine. It only used about 10% of my CPU, and it used a lot of memory (over 3GB):

$ ./timed ./client-async-sem 1000000
Memory usage: 3815076KB	Time: 1544.04 seconds

Note: Paweł’s version only took 9 minutes on his laptop and used all his CPU, so I wonder whether I have made a mistake somewhere, or whether my version of Python (3.5.2) is not as good as a later one.

The limited_as_completed version ran in a similar amount of time but used 100% of my CPU, and used a much smaller amount of memory (162MB):

$ ./timed ./client-async-as-completed 1000000
Memory usage: 162168KB	Time: 1505.75 seconds

Now let’s try 100 million requests. The semaphore-based version lasted 10 hours before it was killed by Linux’s OOM Killer, but it didn’t manage to make any requests in this time, because it creates all its futures before it starts making requests:

$ ./timed ./client-async-sem 100000000
Command terminated by signal 9

I left the limited_as_completed version over the weekend and it managed to succeed eventually:

$ ./timed ./client-async-as-completed 100000000
Memory usage: 294304KB	Time: 150213.15 seconds

So its memory usage was still very bounded, and it managed to do about 665 requests/second over an extended period, which is almost identical to the throughput of the previous cases.

Conclusion

Making a million requests is usually enough, but when we really need to do a lot of work while keeping our memory usage bounded, it looks like an approach like limited_as_completed is a good way to go. I also think it’s slightly easier to understand.

In the next post I describe my attempt to get something like this added to the Python standard library.

Python – printing UTC dates in ISO8601 format with time zone

Andy Balaam from Andy Balaam&#039;s Blog

By default, when you make a UTC date from a Unix timestamp in Python and print it in ISO format, it has no time zone:

$ python3
>>> from datetime import datetime
>>> datetime.utcfromtimestamp(1496998804).isoformat()
'2017-06-09T09:00:04'

Whenever you talk about a datetime, I think you should always include a time zone, so I find this problematic.

The solution is to mention the timezone explicitly when you create the datetime:

$ python3
>>> from datetime import datetime, timezone
>>> datetime.fromtimestamp(1496998804, tz=timezone.utc).isoformat()
'2017-06-09T09:00:04+00:00'

Note, including the timezone explicitly works the same way when creating a datetime in other ways:

$ python3
>>> from datetime import datetime, timezone
>>> datetime(2017, 6, 9).isoformat()
'2017-06-09T00:00:00'
>>> datetime(2017, 6, 9, tzinfo=timezone.utc).isoformat()
'2017-06-09T00:00:00+00:00'