In Which I Met a Paper Deadline (Barely)
Last week was pretty harrowing on its own, and then there was this paper deadline I have been putting off for like 8 months now. Admittedly, I live on the (now almost diminished) returns of planned procrastination. But something about this was even more repulsive than usual. The task involved scraping, wrangling and processing lots of data from a PDF document, because it didn't seem to exist in a structured digital format anywhere. I despise menial tasks.
Surviving in the proprietary software reliant academia can be trying for people of my….disposition. I am studying Social Science, the situation shouldn't be dire given that the best tools of trade are FOSS anyway, but faculties can and regularly did make life painful. But anyway, in this grave hour my trysts with FOSS could come through, hence this tribute.
The task involved exploring historical data of sectoral level Budget allocations which was only accessible through accursed PDFs (short of physically going over to relevant institutions I hope) because nearly everything is like this here. Past experiences already consolidated PDFs of just scanned images in my list of top 5 most hated things. I know that things like Tesseract exist, but this time I was eminently relieved to find my document OCRed all the same. I fed them to
pdftotext program from
poppler-utils package. The resulting blob of text had the layouts of the pdf intact, so it was good start, thankfully.
Then I brought out the text editor. I used (neo)vim for navigation, selection and piping to external program. I should mention that I have discovered couple of really cool evolutions of Vim: kakoune and vis. Both combines the power of vim alongside a way to do multiple selection (à la sublime text except on steroid). Not that these can't be done in Emacs; multiple-cursors, iedit or visual-regexp-steroids are pretty cool for starters. Cool as they are though, shenanigans inside text editor alone wasn't going to scale.
The Grammar of Chaos
Someone wise once said that many problems can and justifiably should be turned into that of a compiler. Broadly, not just because we understand compilers well, but because language is a great abstraction to deal with complexities. Now, I love Perl 6 and it brings built-in Grammars on the table. Regexes are not only a 'first class' language citizen in it, but Grammars are a notch higher in the power ladder than your garden variety regular expression. The AWK/Perl family also integrates phenomenally with the shell. I wrote a very loose "parser" in it that combed through the data, and spit them out in structured form for further operation.
Okay said script was….a bit of a monstrosity. And the fact that revisiting it a few days later didn't make me want to gouge my eyes out somehow felt like a very genuine and positively high praise I could offer it (which probably says something about regexes in general, they are insanely useful though and you can only pry them from my cold dead hands). However, I am very interested in alternate (declarative) approaches to text processing (e.g. TXR looks cool and bundles a Lisp too!).
As for storage? Why use a spreadsheet or DBMS when the
filesystem is perfectly adequate. Directories captured the hierarchical nature of data which resided in the leaf nodes. Just like that, I could browse them with normal file browser (I became partial to vifm), batch operate on them both selectively (with the venerable find) and parallelly (with parallel, duh), and manipulate data using tools like miller, all from the command line. Of course, not until the first few catastrophes did the importance of a way to rollback was realised. But the solution was as simple as turning it into a git repo.
Sorting Through Faces
One can't explore data without visualising. Since I wasn't vendor locked-in, so to speak, to use any specific platform, I had the freedom to treat this as just another independent problem in the pipeline. I decided to use gnuplot here, the fact that gnuplot has a DSL sealed the deal because, again, I believe working on language level is more flexible upfront than API level (library). Case in point, I only know the primitives of gnuplot DSL, my needs grew to requiring a looping constract. Turns out the DSL does have internal iteration support. But why bother with it when you can use the loop of your host language (Bash here)? So I spit out the 'unrolled' directives and the generated script was quite the gigantic mass, and yet gnuplot obliged with a thousand of high quality plots almost instantly. To be honest though, the quality tuning was far from smooth sailing with the desired information being so much scattered in arcane parts of the web. But overall I can attest to its capability.
As for selectively browsing them, again command line to the rescue. I give you one example. The directory of the plots mirrored the structure of the directory of the data. Except I have several of the plot dirs, each representing different ways the data was normalised. What to do if I want to view each normalised versions together? Piping the output of
sort? Almost, except it sorts on the beginning of whole path, but what I want is to sort only on the basename. Well, all right:
find dir1 dir2 -name *.png | perl6 -e 'lines.sort(*.IO.basename)>>.say' | feh -Fd -f -
And as you can see, feh is handy, and is so because it adheres to good CLI design!
Actually Writing the Thing
That was all well, except after everything I was still to start writing the actual thing. The toils of this part couldn't be automated away, though writing in Emacs helped. On the technical side, I liked the fact that I achieved some measure of polish thanks to
LaTeX. I thought about starting with AUCTeX, but seeing how I am not very proficient in LaTeX and wasn't likely to need advanced features anyway, I decided to write in Org-mode. It was stretched a bit, I had to modify export configs and inline LaTeX commands liberally, but productivity and organizational benefits of Org-mode are just worth it. One of my most loved things about Emacs is the whole hooks mechanism, just had to add the export function in the after-save-hook. One can obviously go all the way in Emacs (there is even pdf-tools, which is great if you like to annotate), but I used some external tools here. For example, I had the watcher program entr running which automatically took care of compilation to PDF upon change (and notify if that went wrong), Zathura knows to pick up change automatically.
Ironically, I am not going to talk about the paper because that itself doesn't matter here. The ensememble stars of this show were the tools, whose existence itself owes to the philosophy of FOSS. May that live long.