Since like most reasonable people I read a lot of fantasy books, I figured that applying NLP methods to the content of those books would help improve my intuitive understanding of the techniques involved. For this purpose, I needed as many fantasy books as possible in a format that could be easily processed in Python or Lisp.

I started by scraping a list of fantasy authors from wikipedia.

import requests
import lxml.html.soupparser as sp
from urllib.parse import urljoin

url = 'https://en.wikipedia.org/wiki/List_of_fantasy_authors'
r = requests.get(url)
d = sp.fromstring(r.text)

# List of links to exclude from the results
exclude = ['Fantasy', 'List of fantasy novels', 'List of high fantasy fiction',
           'List of horror fiction authors', 'List of science fiction authors',
           'Lists of authors', 'website', 'Science Fiction and Fantasy Writers of America']

authors = [dict(name=x.text, url=urljoin(url, x.get('href')))
           for x in d.xpath('//div[@class="mw-parser-output"]/ul/li/a')
           if x.text not in exclude]

I proceeded to search for ebooks by each of these authors on The Pirate Bay. Luckily, The Pirate Bay has a nice and easy-to-use API.

import subprocess
from urllib.parse import urlencode

def search_tpb(query):
    params = dict(q=query, cat=601)  # Category 601 stands for ebooks
    return requests.get(f'https://apibay.org/q.php?{urlencode(params)}').json()
    
def get_torrent(info_hash):
    link = 'magnet:?xt=urn:btih:' + info_hash
    subprocess.call(['transmission-remote', '-a', link])

for author in authors:
    # Only consider torrents that have seeders and are below 20MB in size
    results = [x for x in search_tpb(author['name'])
               if int(x['seeders']) > 0 and int(x['size']) / 1048576 < 20]
    if results:
        # Only consider first result
        get_torrent(results[0]['info_hash'])

True ebooks, i.e. books that contained mainly text, were rarely larger than 2-3MB, but collections of books could become about 10-20MB in size. Most torrents larger than that either contained a lot of images or were sorted into the ebook category by accident and were actually audio books or something else. Thus, limiting the file size to 20MB made sense.

Since the first result when searching for an author was often a collection of books by that author, I decided to download only the first result.

To download the torrents, I enabled remote access in my torrent client transmission-gtk, installed transmission-cli (apt install transmission-cli) and then simply used subprocess to trigger adding the torrent file and starting the download.

The next morning, I found that most of the downloads were complete (in total 282). Some didn’t complete, probably because the seeders disappeared.

Upon inspecting the downloaded data, I found that the books came in various formats (some more exotic than others) and often more than one format was provided for a given book.

Quick overview of all file types.

>>> import os
>>> set(os.path.splitext(x)[1] for _, _, files in os.walk('data/download') for x in files)
{'', '.msi', '.prc', '.epub', '.lit', '.png', '.mp3', '.lrf', '.jpg', '.mobi', '.html',
'.doc', '.ini', '.rtf', '.opf', '.part', '.css', '.mbp', '.txt', '.ppt', '.azw3',
'.nfo', '.torrent', '.pdf', '.rar', '.RAR', '.zip', '.htm', '.gif', '.TXT'}

To simplify processing, I deleted all files that had no file extension and then proceeded to rename all files with uppercase file extensions to their lowercase equivalent. To accomplish this, I used the rename tool (apt install rename).

cd data/download
find . -type f  ! -name "*.?*" -delete  # Double check you're in the correct directory first!!
find . -type f -print0 | xargs -0 -I {} rename 's/\.([^.]+)$/.\L$1/gi' "{}"

After that, since calibre (the tool I was gonna use for conversion) couldn’t handle old-style doc files, I converted doc files to html with LibreOffice’s CLI.

cd data/download
find . -name '*.doc' -exec soffice --headless --convert-to html {} \;

Then, I converted all books that were not already in the epub format to the epub format using calibre’s ebook-convert command (apt install calibre).

import os
import shutil
import subprocess

output_dir = 'data/epub'

# Walk entire download folder
for dirpath, dirs, files in os.walk('data/download'):
    # Group files by filename root (filename without extension)
    groups = dict()
    for filename in files:
        root, ext = os.path.splitext(filename)
        groups.setdefault(root, []).append((root, ext))

    # Iterate over each group
    for key, values in groups.items():
        # If there is an epub file in the group, copy it and be done
        epubs = [x for x in values if x[1] == '.epub']
        if epubs:
            filename = key + epubs[0][1]
            shutil.copyfile(os.path.join(dirpath, filename),
                            os.path.join(output_dir, filename))

        # Otherwise check for other accepted file types and convert the first match to epub
        else:
            other_exts = ['.mobi', '.lrf', '.rtf', '.lit', '.prc',
                          '.rar', '.zip', '.pdf', '.html', '.htm', '.opf']
            # Here we iterate over possible extensions instead of the values because order
            # matters: other_exts is sorted by most to least desirable for conversion
            other_ebooks = [(key, ext) for ext in other_exts if (key, ext) in values]
            if other_ebooks:
                filename = key + other_ebooks[0][1]
                subprocess.check_call(['ebook-convert', os.path.join(dirpath, filename),
                                       os.path.join(output_dir, key + '.epub')])

At first I encountered an error from calibre: PyCapsule_GetPointer called with incorrect name. Replacing the calibre version that came with Ubuntu (apt purge calibre) with the newest version from https://calibre-ebook.com/download_linux solved that problem.

A few hours later, I had a whopping 3419 epub files ready to be processed. I deleted the downloaded files to save disk space.

I used python’s ebooklib package to read the epub files and dumped the metadata + content into json files.

import os, glob, json

import ebooklib
import lxml.html.soupparser as sp

def boom(l):
    """Join list of text and normalize whitespace."""
    return ' '.join(' '.join(l).split())

for filename in glob.glob('data/epub/*'):
    try:
        book = ebooklib.epub.read_epub(filename)
    except:
        # Skip unreadable epub files
        continue

    def get_dc(field):
        result = book.get_metadata('DC', field)
        return result[0][0] if result else None

    # Info from metadata is not always available or correct
    # but shall be kept as is for now
    title, creator = get_dc('title'), get_dc('creator')

    # Simply extract all text from the book and normalize whitespace
    text = boom([boom(sp.fromstring(x.get_body_content(), features='html.parser').xpath('//text()'))
                 for x in book.get_items()
                 if x.get_type() == ebooklib.ITEM_DOCUMENT])

    # Dump content + metadata to json file
    basename = os.path.basename(filename)
    json.dump(dict(title=title, creator=creator, epub=basename, text=text),
              open(os.path.join('data/json', os.path.splitext(basename)[0] + '.json'), 'w'))

After that I had a bunch of json files that I could easily use to explore NLP techniques.

If you replicate this process and intend to read one of downloaded books I recommend that you buy the book or find another way to support the author.