Archive Reddit User History

July 20, 2019

Categories: Misc Tags: Data

Actually, this is more of a post about the Pushshift API. If you are into computational sociology or just want to use Reddit better (the default search is pretty bad) then you must look at it. Recently, I just wanted to archive certain users.

First I went at it using official API (to which PRAW is a nice wrapper around). However, the provided user history only goes back upto 1000 comments and submissions, which is not enough. The other problem may sound iffy, but I wanted deleted history. No one in the internet should be naive enough to believe that delete button actually does what it claims, more so in reddit with millions of bots subscribing to the live feed. Besides, many things in reddit are no longer accessible, despite those not being taken down by respective authors explicitly anyway. Hence my need for archiving.

This is trivially possible using Pushshift. One slight hiccup is that the current version of the API returns result as chunk of 500 items. So you do need to do some bookkeeping of your own. Here is my bash script (with only curl and jq as dependency):

#!/usr/bin/env bash


fetch="curl -s -G -d 'author=${username}' -d 'sort=asc' -d 'sort_type=created_utc' -d 'size=500'"

mkdir -p $tmp && cd $tmp

eval "$fetch" > $current

while true; do    
    last_time=`cat $current | jq '.data[-1].created_utc'`
    if [[ $last_time -ne 'null' ]]; then
        eval "$fetch -d 'after=${last_time}'" > $current

rm $current

ls -tr | xargs -L1 -I{} jq 2>/dev/null '.data[]' {} > "$username"

cd -

mv ${tmp}/${username} .

rm ${tmp}/*

Just pass the username as first argument, and the script will save the history in JSON format (I have learned to love it since jq is such a great tool). For quickly skimming it in my pager (less) I do:

#!/usr/bin/env sh
jq '.body' $1 | python | less

Where the looks like:

import sys
import html.parser

pars = html.parser.HTMLParser()
delim  = "\n{}\n".format("-" * 80)

for line in sys.stdin.readlines():

Did I just broadcast to the world that I blindly use eval? *gulps*

I tried the new API (still beta) and it's faster (ElasticSearch/Lucene optimisation) and might not have the above limitation once it comes out. Right now there is even a way to get real-time data using SSE stream.