Analysing Writing Habit in Org-mode: With Elisp and R (Tidyverse)

December 30, 2021

Categories: Misc Tags: R elisp

I got into Emacs in late 2015, and I immediately took to writing more. A neat personal knowledge management system grew around it, but I also became more serious with keeping diary. It hadn't been a habit I could make time for everyday, but by and large it had still been a habit I managed to stick to with the most consistency.

With the end of 2021, I now have 6 years worth of journal to reflect on. Before org-mode I still used to take notes, but there was not much structure or consistency to it. And then I managed to lose those files in an HDD crash anyway.

The intent of this post is not to look for insights buried in data. Firstly, I want to showcase the capability of Org-mode as a structured format, that allows programmatic access to the parse tree of the document that could be further traversed and manipulated. In theory, this should make doing analysis on it very easy. And secondly, I want to do as much literate programming as possible using org-babel (which is like Jupyter notebook except doesn't suck). I want this whole post to be executable code. Basically I want to analyse org-mode in org-mode.

Word Count

Let's start with an easy one. Something that could be accomplished in one line with wc from coreutils.

Now, my diary structure is simple. There is one file per year, aptly named after the year:

(thread-last
  (file-expand-wildcards "~/.notes/20*.org")
  (seq-map #'file-name-nondirectory))
("2016.org" "2017.org" "2018.org" "2019.org" "2020.org" "2021.org")

Each file has top level headings, denoting each month. Each month have children denoting each day, and that's it. Now the org-mode parser can turn the document to an AST and the provided org-element.el exposes API with which the content could be extracted or modified. However, the API is somewhat low-level. There are community libraries that built on it to provide a much nicer, and most importantly declarative interface to do the same. I am partial to this great library called org-ql.

(require 'org-ql)

(defun calculate-words-by-month (year)
  (let ((diary (find-file-noselect (format "~/.notes/%s.org" year))))
    (prog1
        (org-ql-select (list diary) '(level 1)
          :action
          (lambda ()
            (let* ((element (org-element-at-point))
                   (title (org-element-property :raw-value element))
                   (beg (org-element-property :begin element))
                   (end (org-element-property :end element)))
              (cons title (count-words beg end)))))
      (kill-buffer diary))))

(calculate-words-by-month 2016)
(("January" . 10683) ("February" . 15417) ("March" . 12826) ("April" . 10882) ("May" . 13269) ("June" . 10039) ("July" . 10058) ("August" . 7958) ("September" . 9591) ("October" . 10245) ("November" . 6338) ("December" . 5598))

Which is neat. I just asked for all level 1 entries (months), and in the action callback that's invoked on each nodes that's returned by the query, I am just calling count-words with appropriate range, and finally returning each labelled datapoint in a cons cell.

But if I wanted to format output/value of this org-babel cell directly into an org-mode table for presentation purposes, I would have to transform this into a generic list.

(nconc
 (list '("Month" "Word Count") 'hline)
 (cl-loop for (head . tail) in (calculate-words-by-month 2016)
          collect (list head tail)))
MonthWord Count
January10683
February15417
March12826
April10882
May13269
June10039
July10058
August7958
September9591
October10245
November6338
December5598

Time to get all data dumped as CSV. This is kind of grunt work.

(let ((data (seq-map #'calculate-words-by-month (range-uncompress '((2016 . 2021))))))
  (with-temp-file "/tmp/analysis.csv"
    (erase-buffer)
    
    ;; In the first row we write column names
    ;; Which is the year at first
    (insert "Year,")
    ;; And then the names of months
    (insert (string-join (mapcar #'car (car data)) ","))
    (newline)

    ;; Then in each row we write the year
    ;; And then the broken down monthly values for the year
    (cl-loop for year from 2016 upto 2021
             for row in data
             do (progn
                  (insert (format "%d," year))
                  (thread-first
                    (cl-loop for (_month . wc) in row
                             collect (number-to-string wc))
                    (string-join ",")
                    (insert))
                  (newline)))))

(with-temp-buffer
  (insert-file-contents-literally "/tmp/analysis.csv")
  (buffer-string))
Year,January,February,March,April,May,June,July,August,September,October,November,December
2016,10683,15417,12826,10882,13269,10039,10058,7958,9591,10245,6338,5598
2017,10317,7425,7158,7897,6378,6886,6620,6868,6330,7056,5068,6306
2018,8932,5876,6829,8128,4823,3291,6998,8059,6550,5412,3355,3424
2019,3063,3558,4516,5000,3803,3373,2510,2771,3206,2455,1872,2019
2020,2762,2183,1696,1753,4169,2656,2940,5189,4423,4063,3690,2907
2021,2183,1548,2074,3074,2744,2833,3215,3954,4100,2897,3154,4000

I wasn't really sure how the data ought to be oriented. But I also didn't care to weigh pros and cons of it, one way or another. Because I can just enable csv-mode in the resultant buffer, and if I wanted to transpose, I only need to invoke csv-transpose (C-c C-t). And voilĂ ,

Month201620172018201920202021
January10683103178932306327622183
February1541774255876355821831548
March1282671586829451616962074
April1088278978128500017533074
May1326963784823380341692744
June1003968863291337326562833
July1005866206998251029403215
August795868688059277151893954
September959163306550320644234100
October1024570565412245540632897
November633850683355187236903154
December559863063424201929074000

Note also that after transposition the column name should be changed from "Year" to "Month" to reflect shape change. This is like hidden cut in those so called single-take films :P

Anyway, it also doesn't really matter, because here we are passing the mantle to R, for which this stuff is bread and butter. But okay, let's keep the transposed version by simply saving the file. And now, typically I work with R using the phenomenal Emacs Speaks Statistics package. But I want to keep using org-babel, and turns out I can because org-babel has a session feature (even implemented in ESS under the hood). Just need to enable R first,

(org-babel-do-load-languages
 'org-babel-load-languages
 '((R . t)))

Optionally, on R side one might want to install the "ascii" package as well, as it can format tables in a way that's directly compatible with org-mode.

install.packages("ascii")

From now on, I am going to be (conceptually) passing every R source block a :session argument For example,

#+begin_src R :session analysis :results output wrap
library(ascii)
options(asciiType="org")

x <- matrix(1:4, 2, 2)
ascii(x)
#+end_src

But since that can become tedious, in reality I would be adding this to the top of my org file, and have it apply to all R source blocks automatically.

#+PROPERTY: header-args:R  :session analysis

And now the great thing is, the above snippet automatically produces an org-mode table (have to pass :results output wrap), which in turn gets turned into HTML by my blogging system (Hugo). And after sprinkling some CSS to it, what appears to you is:

1.003.00
2.004.00

Cool! Time to bring big guns. Let's also read the input, and prepare some helper variables.

library(tidyverse)
library(ggplot2)
library(hrbrthemes)

input <- read_csv(
    "/tmp/analysis.csv",
    col_types = cols(.default = "i", Month = "c"),
)

years <- as.character(2016:2021)
months <- input$Month

ascii(input)
Month201620172018201920202021
1January10683.0010317.008932.003063.002762.002183.00
2February15417.007425.005876.003558.002183.001548.00
3March12826.007158.006829.004516.001696.002074.00
4April10882.007897.008128.005000.001753.003074.00
5May13269.006378.004823.003803.004169.002744.00
6June10039.006886.003291.003373.002656.002833.00
7July10058.006620.006998.002510.002940.003215.00
8August7958.006868.008059.002771.005189.003954.00
9September9591.006330.006550.003206.004423.004100.00
10October10245.007056.005412.002455.004063.002897.00
11November6338.005068.003355.001872.003690.003154.00
12December5598.006306.003424.002019.002907.004000.00

So it reads the CSV file fine.

Now I want to produce some plots (using ggplot2 of course). But how would they appear here? Do I write functions to output them to some image file? Org-mode can do that, but then I would still need to manually link said file here, which would not be in the spirit of a "notebook".

I just need to set some path for both R to write to, and my blogging system to automatically pick it up and link the image here. I think the only way for both to work is if I set current dir in R to wherever this post is, and then use relative path for the static file destination.

Setting path in R, by interpolating an elisp variable, could be accomplished with:

#+begin_src R :session analysis :var dir=(directory-file-name default-directory)
setwd(dir)
#+end_src

We can make two different languages talk like this! I guess I didn't even need to generate intermediate CSV?!

And now, let's define a function that can give us the relative path for the static image folder. We don't even need to do this mentally. We can figure out the well defined common ancestor, which can be the project root (Emacs nowadays has built-in project.el for all things project related, so no need to reach for third party packages like projectile). And then we can divine the static asset folder from there (creating, if not exists). And finally, we can ask Emacs to calculate the relative path from this post directory to the image folder!

(defun get-image-asset-dir ()
  (let* ((root (project-root (project-current)))
         (image-dir (file-name-concat root "static/images/"))
         (post-dir default-directory))
    (mkdir image-dir t)
    (file-relative-name image-dir post-dir)))

(get-image-asset-dir)

Which gives us:

../../../static/images/

That's what we want! Now let's ask org-babel to save the image file in that appropriate directory by passing :output-dir (get-image-asset-dir) to the block containing ggplot2 code (this is also something you we can make global by defining on top of the file, like before).

wc_by_year <- tibble(
    year = fct_rev(factor(years, levels = years)),
    wc = input %>% select_at(years) %>% colSums()
)

ggplot(wc_by_year, aes(year, wc)) +
    geom_col() +
    coord_flip() +
    labs(x = "", y = "", title = "Word Count by Year") +
    theme_ipsum_rc(grid="X")

And boom:

../../../static/images/wc_by_year.png

On a tangential note, previewing the plots inline in org-mode as well as having the links work in published site generated by Hugo, posed an interesting problem. In both cases the relative path needs to be backed by actual content: a file in the filesystem in the first case, and a static asset for the webserver in the second case. However, by design hugo takes the things in assetDir (which is static/ by default) and splatters the content of it across the root of the published directory. Which basically means the assets URI necessarily needs to be stripped off the top level assetDir part. Because as is, the links we are generating for the plot which enable us to preview images inline in org-mode no longer work in published document.

Could the converter/exporter be somewhat clever here and fix those links? I am not sure that's within their purview of responsibility. However, there is a way to stop Hugo yourself from stripping the top-level asset directory in the first place, which would solve the issue at its source:

[[module.mounts]]
  source="static"
  target="static/static"

Anyway, this confirms what I already know. The volume was only really good the first year or so. Now it has shrunk down to almost 25% of 2016 :(

The UNIX hacker is smirking thinking how I have only now achieved parity with what wc could do in one line. Though if you twist the question a little bit, like how about word count grouped by months instead of year? And now wc will have no answer to that. but we have the foundation to quickly leave it to dust.

wc_by_month <- tibble(
    month = fct_rev(factor(months, levels = months)),
    wc = input %>% select_at(years) %>% rowSums()
)

ggplot(wc_by_month, aes(x = month, y = wc)) +
    geom_col() +
    scale_y_comma(limits = c(0, 42000)) +
    coord_flip() +
    labs(x = "", y = "", title = "Word Count by Month") +
    theme_ipsum_rc(grid="X")

../../../static/images/wc_by_month.png

Hmm that's something I didn't know. Apparently for some reason I really don't like to write in November/December? I think I should normalise the numbers within particular year though, because the circumstances each year probably were very different.

Btw, R has anonymous function now!

wc_by_month_normalised <- tibble(
    month = fct_rev(factor(months, levels = months)),
    wc = input %>% select_at(years) %>% mutate_all(function(x) x/sum(x)) %>% rowMeans()
)

ggplot(wc_by_month_normalised, aes(x = month, y = wc)) +
    geom_col() +
    coord_flip() +
    labs(x = "", y = "", title = "Word Count by Month (normalised)") +
    theme_ipsum_rc(grid="X")

../../../static/images/wc_by_month_normalised.png

It's pretty much same apart from some subtle changes. Now August pulls ahead of January and April, by virtue of being more consistent against other months in same year. That tracks I guess, I probably spend more time indoors in August because I live in tropics and that's when monsoon hits and I really hate muds and clogged waters everywhere.

Missing Days

Volume in terms of word count is one thing, but I want to know how consistently I showed up to write over time (without being late or missing days altogether). As I hinted, there is a header some level deep that's the date, and the content therein is basically the entry.

The date is easily parseable as it's the title of the node in dd.mm.yyyy format (yes, not very ISO 8601 of me).

Back to the drawing board ∗scratch∗ buffer for some more elisp.

Before I figure out missing days, first I need a list of all days in a year. This works:

(defun diary-all-days-in-year (year)
  (let* ((day 1)
         (time (encode-time `(1 0 0 ,day 1 ,year nil nil nil)))
         (all nil))
    (while (= (nth 5 (decode-time time)) year)
      (setq all (cons (format-time-string "%d.%m.%Y" time) all))
      (setq day (+ 1 day)
            time (encode-time `(0 0 0 ,day 1 ,year nil nil nil))))
    (nreverse all)))

(let ((days (diary-all-days-in-year 2020)))
  (format "Total days in 2020: %d\n" (length days)))
Total days in 2020: 366

Now I want to calculate how many days have corresponding entry in the diary:

(defun diary-all-entries (year)
  (org-ql-select (list (concat "~/.notes/" (int-to-string year) ".org"))
       `(and (heading-regexp ,(int-to-string year)) (level 4))
       :action (lambda () (org-element-property :raw-value (org-element-at-point)))))

(let ((days (diary-all-entries 2020)))
  (format "Total days with an entry in 2020: %d\n" (length days)))
Total days with an entry in 2020: 355

The difference in these two sets are basically the missing days. Let's generalise that across all years:

(nconc
 (list '("Year" "Misses") 'hline)
 (cl-loop for year from 2016 upto 2021
          collect (list year (length
                              (seq-difference
                               (diary-all-days-in-year year)
                               (diary-all-entries year))))))
YearMisses
20160
20171
20180
20196
202011
20212

Although it works great, it's still slightly misleading. Because when I do miss a day, for the most part I try my best to go back and try to plug the gap later. Which is great, but I want to also see how often I was late.

That would be impossible to do, if it wasn't for the fact that I have some helper function that generates the boilerplate before I write which takes care of creating headline and most importantly current timestamp.

A timestamp in org-mode looks like [2022-01-01 Sat 16:47]. I can look for it with a regex, and then parse it (for which I will use the awesome ts.el library).

(require 'ts)

(defun diary-late-entries (year)
  (let ((ts-pattern (rx line-start (any "<" "[") "20" (+? nonl) (any ">" "]") line-end)))
    (flatten-tree
     (org-ql-select (list (concat "~/.notes/" (int-to-string year) ".org"))
       `(and (heading-regexp ,(int-to-string year)) (level 4))
       :action (lambda ()
                 (let* ((element (org-element-at-point))
                        (date (org-element-property :raw-value element))
                        (beg (org-element-property :contents-begin element))
                        (end (org-element-property :contents-end element))
                        (timestamp) (timedate))
                   (goto-char beg)
                   (when (search-forward-regexp ts-pattern end t 1)
                     (setq timestamp (ts-unix (ts-parse-org (match-string-no-properties 0)))
                           timedate (format-time-string "%d.%m.%Y" timestamp))
                     (unless (string= date timedate)
                       (setq timedate (format-time-string "%d.%m.%Y" timestamp (* -2 3600)))
                       (unless (string= date timedate)
                         date)))))))))

(let ((days (diary-late-entries 2020)))
  (format "Total late days in 2020: %d\n" (length days)))

It got a little complicated because I had to account for my nocturnal nature (sometimes I begin writing way past midnight, but I don't want to count that as "late").

However the result is still kind of horrifying.

Total late days in 2020: 124

Oops.

I can now generate report for all years:

(nconc
 (list '("Year" "Late") 'hline)
 (cl-loop for year from 2016 upto 2021
          collect (list year (length (diary-late-entries year)))))
YearLate
20167
201752
201866
201992
2020124
2021114

And Well, that's a grim reading.

Remarks

Well I already knew in my bones that I am doing worse at writing, so that's no news and that's not what I was trying to highlight.

But it was a good medium for showcasing the capabilities of org-mode because I didn't set out to keep journals with the intention of later being able to do any of this. Org-mode is simply good at being a text based database.

The other thing I want to say is, Elisp is a damn good general purpose language. Just because it's married to a text editor doesn't change that fact. It has almost all the features and basic libraries you come to expect from any modern language. Only glaring problem is probably the concurrency shortcomings but hey it's not like popular languages like Python, Ruby or Javascript covers themselves in glory there.

In my experience, the detractors of Elisp are mostly CL users. They also love to rain on the parade of pretty much every other Lisps, and they are not above gatekeeping about what makes a Lisp and what doesn't. If you look at Github Stats from 2021 Q4, by commit push count Elisp is at #29, whereas CL is nowhere to be found in top 50. By PR new count, Elisp is at #31 whereas CL is again nowhere to be found. By new issues count, Elisp is at #22 and you can finally get a whiff of CL at #46. Elisp is the most popular Lisp in the world along with Clojure, which probably explains the salt.