Samathy from Stories by Samathy on Medium
PDFs are the format of choice in academia, but extracting the information they contain is annoyingly hard.
Iâ€™ve just started working on my degreeâ€™s final project. An academic project requires lots of research, which means reading lots of papers.
Papers are normally available in one form only,Â PDF.
While PDF is a format so ubiquitous nowadays that one can guarantee being able to display it as the writer(s) intended, its not a nice format, as I found out as soon as I needed to do something withÂ it.
During the course of my research, Iâ€™ve been using PDFâ€™s highlight annotations to highlight parts of a paper thatâ€™re particularly interesting.
I wanted to be able to retrieve the highlighted text at a later date so I didnâ€™t have to open the paper again to find the parts I found interesting when I read it the firstÂ time.
Youâ€™d think that exporting annotations on text would be something that all PDF readers which support annotations (most of them do) would be capable of. I mean, surely its easy enough even if there arnt that many reasons why youâ€™d want to doÂ it.
Alas, none that I found running on Linux had this feature, so I delved into trying to write something to do what IÂ needed.
I based my project on a tool I found in a StackOverflow answer to a question similar to mine.
The Python code in the answer utilises poppler-qt4 to export annotated text from a PDF. Unfortunately, the code is Python2 and the python poppler-qt4 package wouldn't install properly on my system anyway, even after installing the poppler-qt4 package.
Neither did Pythonâ€™s poppler-qt5 bindings.
Convinced I could do a better job than a Python 2 script which depended on a package last updated in 2015, I translated the answer into the equivalent inÂ C++.
I started with trying to use poppler-cpp, the C++ bindings for poppler where one has objects and namespaces, and none of the guff associated with GUI frameworks that I wouldn't need here. However, to my dismay, poppler-cpp doesn't support annotations at all. For whatever reason, annotation support only works with the bindings to a GUI framework, like glib orÂ QT.
So instead I used poppler-glib (i.e glib from the GNOME project). Purely because I use GNOME, so wouldn't have to install anythingÂ extra.
Now, the PDF format is really odd. Annotations seem to be an after-thought to the format tacked on later.
Specifically highlighting is weird, because a highlight annotation has no connection to the documentâ€™s text.
As such, popplerâ€™s poppler_annot_get_contents(PopplerAnnot *) which should return the annotationâ€™s contents, returns nothing.
Instead, to get the text associated with a highlight annotation, one has to get the coordinates of the highlight annotation (A PopplerRectangle) and then utilise the function poppler_page_get_text_for_area(PopplerPage*, PopplerRectangle*) which returns the text in a definedÂ area.
What an entirely baffling way to go about implementing highlighting. Attaching it as purely a visual element, rather than actually marking up theÂ text.
Even more baffling is the fact that although my application works, it only mostly works.
Sometimes I get the full text highlighted, other times it chops off characters, and sometimes it adds things thatâ€™re nowhere near the highlighted text at all!
This is a problem Iâ€™m yet to solve, and I might never solve, because its ridiculous and the tool mostly does what I neededÂ anyway.
In conclusion; The PDF format is weird, I wrote a thing.
If you use it, let me know how itÂ goes!