I got into Emacs in late 2015, and I immediately took to writing more. A neat personal knowledge management system grew around it, but I also became more serious with keeping diary. It hadn't been a habit I could make time for everyday, but by and large it had still been a habit I managed to stick to with the most consistency.
With the end of 2021, I now have 6 years worth of journal to reflect on. Before org-mode I still used to take notes, but there was not much structure or consistency to it. And then I managed to lose those files in an HDD crash anyway.
I am not really doing this to find some sort of insight buried in data. I mainly want to showcase the capability of Org-mode as a structured format, which allows one to easily get programmatic access to the parse tree of their documents which they can traverse and manipulate. This, in theory, should make doing analysis very easy. Let's see how that assumption fares in practice.
We start with an easy one. Something that could be accomplished in one line with
wc from coreutils.
My diary structure is simple, there is one file per year, aptly named after the year:
(thread-last (file-expand-wildcards "~/.notes/20*.org") (seq-map #'file-name-nondirectory))
("2016.org" "2017.org" "2018.org" "2019.org" "2020.org" "2021.org")
Each file has top level headings, denoting each month. Subsequently each month then has children that at some point denote particular day which we will leave for later. For ease, I am going to use a great library named org-ql that lets you declaratively query for things in your file.
(require 'org-ql) (defun calculate-words-by-month (year) (let ((diary (find-file-noselect (format "~/.notes/%s.org" year)))) (prog1 (org-ql-select (list diary) '(level 1) :action (lambda () (let* ((element (org-element-at-point)) (title (org-element-property :raw-value element)) (beg (org-element-property :begin element)) (end (org-element-property :end element))) (cons title (count-words beg end))))) (kill-buffer diary)))) (calculate-words-by-month 2016)
(("January" . 10683) ("February" . 15417) ("March" . 12826) ("April" . 10882) ("May" . 13269) ("June" . 10039) ("July" . 10058) ("August" . 7958) ("September" . 9591) ("October" . 10245) ("November" . 6338) ("December" . 5598))
Which is neat. I am using cons cells because soon I want to export data for all years into CSV, and that makes it easier. But if I wanted to format output/value of this org-babel cell directly into org-mode table, I needed but to use a generic list.
(nconc (list '("Month" "Word Count") 'hline) (cl-loop for (head . tail) in (calculate-words-by-month 2016) collect (list head tail)))
Time to get all data dumped as CSV, as promised. This is kind of grunt work.
(let ((data (seq-map #'calculate-words-by-month (range-uncompress '((2016 . 2021)))))) (with-temp-file "/tmp/analysis.csv" (erase-buffer) ;; In the first row we write column names ;; Which is the year at first (insert "Year,") ;; And then the names of months (insert (string-join (mapcar #'car (car data)) ",")) (newline) ;; Then in each row we write the year ;; And then the broken down monthly values for the year (cl-loop for year from 2016 upto 2021 for row in data do (progn (insert (format "%d," year)) (thread-first (cl-loop for (_month . wc) in row collect (number-to-string wc)) (string-join ",") (insert)) (newline))))) (with-temp-buffer (insert-file-contents-literally "/tmp/analysis.csv") (buffer-string))
Year,January,February,March,April,May,June,July,August,September,October,November,December 2016,10683,15417,12826,10882,13269,10039,10058,7958,9591,10245,6338,5598 2017,10317,7425,7158,7897,6378,6886,6620,6868,6330,7056,5068,6306 2018,8932,5876,6829,8128,4823,3291,6998,8059,6550,5412,3355,3424 2019,3063,3558,4516,5000,3803,3373,2510,2771,3206,2455,1872,2019 2020,2762,2183,1696,1753,4169,2656,2940,5189,4423,4063,3690,2907 2021,2183,1548,2074,3074,2744,2833,3215,3954,4100,2897,3154,4000
I wasn't really sure how the data ought to be oriented. But I also didn't care to weigh pros and cons of it one way or another. Because I can just enable csv-mode in the resultant buffer, and if I wanted to transpose, I only need to invoke
csv-transpose (C-c C-t). And voilà,
Anyway, it also doesn't really matter, because we are gonna make way for R to arrive and pick up from here, for which this stuff is bread and butter. But okay, we keep the transposed version by simply saving the file. And now, typically I work with R using the phenomenal Emacs Speaks Statistics package. But I want to keep using org-babel, and it of course there is a session feature which even uses ESS under the hood. We need to first load it up,
(org-babel-do-load-languages 'org-babel-load-languages '((R . t)))
On R side, one might want to install the "ascii" package as well, as with it one could get tables formatted in a way that's directly compatible with org-mode.
From now on, we are going to be (conceptually) passing every R source block a
:session argument For example,
#+begin_src R :session analysis :results output wrap library(ascii) options(asciiType="org") x <- matrix(1:4, 2, 2) ascii(x) #+end_src
But since that can become tedious, in reality I would be adding this to the top of my org file, and have it apply to all R source blocks automatically.
#+PROPERTY: header-args:R :session analysis
And now the great thing is, the above snippet automatically produces an org-mode table (have to pass
:results output wrap), which in turn gets turned into HTML by my blogging system (Hugo). And after sprinkling some CSS to it, what appears to you is:
Cool! Time to bring big guns. Let's also read the input, and prepare some helper variables.
library(tidyverse) library(ggplot2) library(hrbrthemes) input <- read_csv( "/tmp/analysis.csv", col_types = cols(.default = "i", Month = "c"), ) years <- as.character(2016:2021) months <- input$Month ascii(input)
So it reads the CSV file fine.
Now we want to produce some plots (using ggplot2 of course). But how would they appear here? Do I write functions to output them to some image file, and manually link said file here? Nope, org-mode can do it all for us!
We just need to set some path for both R to write to, and blogging system to automatically pick up and link the image here. I think the only way for both to work is if I set current dir in R to wherever this post is, and then use relative path for the static file destination.
Setting path in R, by interpolating an elisp variable, could be accomplished with:
#+begin_src R :session analysis :var dir=(directory-file-name default-directory) setwd(dir) #+end_src
Amazing that we can make two different languages talk like this! I guess I didn't even need to generate intermediate CSV?!
And now, let's define a function that can give us the relative path for the static image folder. We don't even need to do this mentally. We can figure out the well defined common ancestor, which can be the project root (a little late but Emacs nowadays has built-in project.el for all things project related, so no need to reach for third party packages like projectile despite how awesome it still is). And then we can simply get the static asset folder for images from there (creating, if not exists). And finally, we can ask Emacs to calculate the relative path from this post directory to the image folder!
(defun get-image-asset-dir () (let* ((root (project-root (project-current))) (image-dir (file-name-concat root "static/images/")) (post-dir default-directory)) (mkdir image-dir t) (file-relative-name image-dir post-dir))) (get-image-asset-dir)
Which gives us:
That's what we want! Now we will be asking ESS to save the image file in that appropriate directory by passing
:output-dir (get-image-asset-dir) to the block, but that's also something you could make global by defining on top of the file, like before.
wc_by_year <- tibble( year = fct_rev(factor(years, levels = years)), wc = input %>% select_at(years) %>% colSums() ) ggplot(wc_by_year, aes(year, wc)) + geom_col() + coord_flip() + labs(x = "", y = "", title = "Word Count by Year") + theme_ipsum_rc(grid="X")
Anyway, this confirms what I already know. The volume was only really good the first year or so. Now it has shrunk down to almost 25% of 2016 :(
The UNIX hacker is smirking thinking how we have only now achieved parity with what
wc could do in one line. Though if you twist the question a little bit, like how about word count grouped by months instead of year? And now
wc will have no answer to that. but we have the foundation to quickly leave it to dust.
wc_by_month <- tibble( month = fct_rev(factor(months, levels = months)), wc = input %>% select_at(years) %>% rowSums() ) ggplot(wc_by_month, aes(x = month, y = wc)) + geom_col() + scale_y_comma(limits = c(0, 42000)) + coord_flip() + labs(x = "", y = "", title = "Word Count by Month") + theme_ipsum_rc(grid="X")
Hmm that's something I didn't know. Apparently for some reason I really don't like to write in November/December? I think we should normalise the numbers within particular year though, because the circumstances each year probably were very different. Btw, R has anonymous function now!
wc_by_month_normalised <- tibble( month = fct_rev(factor(months, levels = months)), wc = input %>% select_at(years) %>% mutate_all(function(x) x/sum(x)) %>% rowMeans() ) ggplot(wc_by_month_normalised, aes(x = month, y = wc)) + geom_col() + coord_flip() + labs(x = "", y = "", title = "Word Count by Month (normalised)") + theme_ipsum_rc(grid="X")
It's pretty much same apart from some subtle changes. Now August pulls ahead of January and April, by virtue of being more consistent against other months in same year. That tracks I guess, I probably spend more time indoors in August because I live in tropics and that's when monsoon hits and I really hate muds and clogged waters everywhere.
Volume in terms of word count is one thing, but I want to know how consistently I showed up to write over time (without being late or missing days altogether). As I hinted, there is a header some level deep that's the date, and the content therein is basically the entry.
The date is easily parseable as it stands alone in the header in
dd.mm.yyyy format (yes, not very ISO 8601 of me).
Back to the
drawing board ∗scratch∗ buffer for some more elisp.
Before we can figure out missing days, first we need a list of all days in a year. This works:
(defun diary-all-days-in-year (year) (let* ((day 1) (time (encode-time `(1 0 0 ,day 1 ,year nil nil nil))) (all nil)) (while (= (nth 5 (decode-time time)) year) (setq all (cons (format-time-string "%d.%m.%Y" time) all)) (setq day (+ 1 day) time (encode-time `(0 0 0 ,day 1 ,year nil nil nil)))) (nreverse all))) (let ((days (diary-all-days-in-year 2020))) (format "Total days in 2020: %d\n" (length days)))
Total days in 2020: 366
Now we want to calculate how many days have corresponding entry in the diary:
(defun diary-all-entries (year) (org-ql-select (list (concat "~/.notes/" (int-to-string year) ".org")) `(and (heading-regexp ,(int-to-string year)) (level 4)) :action (lambda () (org-element-property :raw-value (org-element-at-point))))) (let ((days (diary-all-entries 2020))) (format "Total days with an entry in 2020: %d\n" (length days)))
Total days with an entry in 2020: 355
The difference in these two sets are basically the missing days. Let's generalise that across all years:
(nconc (list '("Year" "Misses") 'hline) (cl-loop for year from 2016 upto 2021 collect (list year (length (seq-difference (diary-all-days-in-year year) (diary-all-entries year))))))
Although it works great, it's still slightly misleading. Because when I do miss a day, for the most part I try my best to go back and try to plug the gap later. Which is great, but we want to also see how often we are late.
That would be impossible to do, if it wasn't for the fact that I have some helper function that generates the boilerplate before I write which takes care of creating headline and most importantly current timestamp.
A timestamp in org-mode looks like
[2022-01-01 Sat 16:47]. We can look for it with a regex, and then parse it (for which I will use the awesome ts.el library).
(require 'ts) (defun diary-late-entries (year) (let ((ts-pattern (rx line-start (any "<" "[") "20" (+? nonl) (any ">" "]") line-end))) (flatten-tree (org-ql-select (list (concat "~/.notes/" (int-to-string year) ".org")) `(and (heading-regexp ,(int-to-string year)) (level 4)) :action (lambda () (let* ((element (org-element-at-point)) (date (org-element-property :raw-value element)) (beg (org-element-property :contents-begin element)) (end (org-element-property :contents-end element)) (timestamp) (timedate)) (goto-char beg) (when (search-forward-regexp ts-pattern end t 1) (setq timestamp (ts-unix (ts-parse-org (match-string-no-properties 0))) timedate (format-time-string "%d.%m.%Y" timestamp)) (unless (string= date timedate) (setq timedate (format-time-string "%d.%m.%Y" timestamp (* -2 3600))) (unless (string= date timedate) date))))))))) (let ((days (diary-late-entries 2020))) (format "Total late days in 2020: %d\n" (length days)))
It got a little complicated because I had to account for my nocturnal nature (sometimes I begin writing way past midnight, but I don't want to count that as "late").
However the result is still kind of horrifying.
Total late days in 2020: 124
We can now generate report for all years:
(nconc (list '("Year" "Late") 'hline) (cl-loop for year from 2016 upto 2021 collect (list year (length (diary-late-entries year)))))
Well, isn't that a grim reading.
Well I already knew in my bones that I am doing worse at writing, so that's no news and that's not what I was trying to highlight.
But it was a good medium for showcasing the capabilities of org-mode because I didn't set out to keep journals with the intention of later being able to do any of this. Org-mode is simply good at being a text based database.
In my experience, the detractors of Elisp are mostly CL users. They also love to rain on the parade of pretty much every other Lisps, and they are not above gatekeeping about what makes a Lisp and what doesn't. If you look at Github Stats from 2021 Q4, by commit push count Elisp is at #29, whereas CL is nowhere to be found in top 50. By PR new count, Elisp is at #31 whereas CL is again nowhere to be found. By new issues count, Elisp is at #22 and you can finally get a whiff of CL at #46. Elisp is the most popular Lisp in the world along with Clojure, which probably explains the salt.