As most reasonable people are familiar with the Harry Potter books, their content serves as ideal material for building a mnemonic system. The mnemonic major system, in particular, is used to memorize number sequences.

In order to implement the steps outlined in this post you need the content of the Harry Potter books (or other book(s) if you prefer). In a previous post I showed you how to download fantasy books and extract their text. Among the downloaded data were the Harry Potter books which I will use in this post.

Step 1: Learn the sound-number mapping

In the mnemonic major system each number from 0 to 9 is associated with one or more consonant sounds. Use the following table as a reference.

Number Sounds (IPA) Letters with example words
0 s, z s (see), c (city), z (zero), x (xylophone)
1 t, d, ð, θ t (tee), d (dad), th (though), th (think)
2 n, ŋ n (nail)
3 m m (monster)
4 r r (right), l (colonel)
5 l l (left)
6 ʤ, ʧ, ʃ, ʒ ch (cheese), j (juice), g (ginger), sh (shell), c (cello, special),
cz (czech), s (tissue, vision), sc (fascist), sch (eschew),
t (ration), tsch (putsch), z (seizure)
7 k, ɡ k (kid), c (cake), q (quarter), g (good), ch (loch)
8 f, v f (face), ph (phone), v (alive), gh (laugh)
9 p, b p (power), b (baby)

In English, letters are pronounced in different ways depending on the context, that’s why some letters are repeated in different rows. But in the end, only the sound matters, not the spelling.

Here are a few examples for words, their IPA representation and the number they encode.

Word IPA Number
action ækʃən 762
muddy mədi 31
midday mɪddeɪ 311
accept æksɛpt 7091
fax fæks 870
exam ɪgzæm 703
anxious æŋkʃəs 2760
luxury ləgʒəri 5764
pizza pitsə 910
ghost goʊst 701
enough inəf 28
fear fɪr 84

To familiarize yourself with the IPA notation, try to read the following excerpt.

Sorting Hat: həm, dɪfəkəlt. vɛri dɪfəkəlt. plɛnti əv kərɪʤ, aɪ si.
             nɑt ə bæd maɪnd, iðər. ðɛrz tælənt, oʊ jɛs. ənd ə θərst tɪ pruv jʊrsɛlf.
             bət wɛr tɪ pʊt ju?
Harry:       nɑt slɪðərɪn. nɑt slɪðərɪn.
Sorting Hat: nɑt slɪðərɪn, ɛ? ər ju ʃʊr? ju kʊd bi greɪt, ju noʊ. ɪts ɔl hir ɪn jʊr hɛd.
             ənd slɪðərɪn wɪl hɛlp ju ɔn ðə weɪ tɪ greɪtnəs, ðɛrz noʊ daʊt əbaʊt ðət. noʊ?
Harry:       pliz, pliz. ɛniθɪŋ bət slɪðərɪn, ɛniθɪŋ bət slɪðərɪn.
Sorting Hat: wɛl ɪf jʊr ʃʊr, bɛtər bi... grɪfɪndɔː!

Consider the same lines converted to number sequences.

Sorting Hat: 3 18751 84 18751 9521 8 746 0 21 91 321 14 140 1521 0 21 8401 1 948 4058 91 4 1 91
Harry:       21 05142 21 05142
Sorting Hat: 21 05142 4 64 71 9 741 2 10 5 4 2 4 1 21 05142 5 59 2 1 1 74120 140 2 11 91 11 2
Harry:       950 950 282 91 05142 282 91 05142
Sorting Hat: 5 8 4 64 914 9 74821

I’m going to show you how to write a training program to internalize the concept of mapping sounds to numbers in Step 4. But first, you need the ability to convert text automatically to numbers. This happens in 2 steps. First, the text is converted to IPA. Then, the IPA is converted to numbers.

Step 2: Converting IPA to numbers

The process of converting IPA to numbers is very simple. I iterate through the IPA chars and if there is a number associated with the char I append the number to the result.

# Mapping from number to sounds
num_to_phones = {0: ['s', 'z'], 1: ['t', 'd', 'ð', 'θ'], 2: ['n', 'ŋ'], 3: ['m'], 4: ['r'],
                 5: ['l'], 6: ['ʤ', 'ʧ', 'ʃ', 'ʒ'], 7: ['k', 'g'], 8: ['f', 'v'],
                 9: ['p', 'b']}

# Reverse mapping from sound to number
phone_to_num = {x: k for k, v in num_to_phones.items() for x in v}

def major_decode_from_ipa(ipa):
    """Convert IPA to number sequence."""
    result = []
    for char in ipa:
        if (num := phone_to_num.get(char)) is not None:
            result.append(num)
    return result

For example, major_decode_from_ipa('dɪfəkəlt') yields [1, 8, 7, 5, 1].

Additionally, I define a couple functions for converting number sequences to and from strings.

def numseq_to_str(numseq):
    """Convert number sequence to string."""
    return ''.join(str(x) for x in numseq)

def str_to_numseq(s):
    """Convert string to number sequence."""
    return [int(x) for x in s if x.isdigit()]

For example, numseq_to_str([1, 8, 7, 5, 1]) yields '18751'.

Step 3: Converting text to IPA

In order to automatically convert text to IPA (and then to numbers) you need to use an IPA dictionary.

Python’s eng-to-ipa package is able to convert text to IPA using the Carnegie-Mellon University Pronouncing Dictionary.

import eng_to_ipa

s = '“I’ll bring up some sandwiches.”'  # Sentence from the HP books
ipa = eng_to_ipa.convert(s, retrieve_all=True, keep_punct=False, stress_marks=False)
print(ipa)

This yields:

['“i’ll* brɪŋ əp səm sandwiches.”*']

According to the docs, eng-to-ipa will reprint words that cannot be found in the CMU dictionary with an asterisk. Thus, “i’ll and sandwiches.” have not been found. Clearly, the punctuation is the issue. I preprocess the text in order to ease the conversion to IPA.

# Punctuation including unicode chars
punctuation = ''.join(chr(i) for i in range(sys.maxunicode)
                      if unicodedata.category(chr(i)).startswith('P'))

def preprocess(text):
    """Strip punctuation between words, normalize space, lowercase, replace unicode apostrophe."""
    return ' '.join(x.strip(punctuation).lower().replace('’', '\'')
                    for x in text.split())
                    
print(eng_to_ipa.convert(preprocess(s), retrieve_all=True, keep_punct=False, stress_marks=False))

And…

['aɪl brɪŋ əp səm sæmwɪʧɪz', 'aɪl brɪŋ əp səm sændwɪʧɪz', 'aɪl brɪŋ əp səm sænwɪʧɪz']

As you can see, there are 3 ways to pronounce this sentence depending on whether you like to pronounce sandwich with m, n or nd. In order to use the major system effectively, you should use the version that sounds most natural to you.

Let’s look at another example.

print(eng_to_ipa(preprocess('Well, if you’re sure — better be GRYFFINDOR!'), retrieve_all=True,
                 keep_punct=False, stress_marks=False))
['wɛl ɪf jur ʃʊr bɛtər bi gryffindor*', 'wɛl ɪf jʊr ʃʊr bɛtər bi gryffindor*']

The word gryffindor was not found in the CMU dictionary, which can be expected. After a quick search for the word’s pronunciation I found YouGlish which uses YouTube videos to find IPAs. While their API is not free, a limited number of IPA’s can be scraped for our purpose.

import time, random, requests, lxml.html.soupparser

def ipa_from_youglish(word):
    """Scrape IPA for word from youglish.com."""
    url = f'https://youglish.com/pronounce/{word}/english?'
    while True:
        print(f'Scraping word "{word}" from youglish...', end='', flush=True)
        response = requests.get(url)
        if 'Usage limit exceeded' in response.text:
            raise Exception('YouGlish usage limit exceeded')
        root = lxml.html.soupparser.fromstring(response.text)
        if root.xpath('//div[@class="g-recaptcha"]'):
            print('RECAPTCHA')
            input(f'Open {url}, submit CAPTCHA challenge, press enter to continue.')
        else:
            break
    time.sleep(random.random() * 3)
    d = root.xpath('//div[@id="phoneticPanel"]/div/ul[@class="transcript"]'
                  '/li/span[contains(text(), "Traditional IPA")]'
                  '/following-sibling::text()')
    if d:
        print('SUCCESS')
        return d[0].strip(' ˈ')
    print('FAILED')

As you can see, this function is semi-interactive. Without user intervention it will get stuck on a CAPTCHA. Even then, you’ll eventually reach their daily usage limit and won’t be able to continue. For our purpose this shall be good enough though.

> ipa_from_youglish('gryffindor')
Scraping word "gryffindor" from youglish...SUCCESS
grɪfɪndɔː

I have shown that for many words there are several possible pronunciations, from which you need to choose your preferred one, and that some words are not in the CMU dictionary and require scraping the IPA from another source or are not available at all. For these 2 reasons, you will need to build your own personal IPA dictionary.

I’m going to build my IPA dictionary by iterating through the words of the Harry Potter books and adding each word and the corresponding IPA to my dictionary.

First, I define some functions for managing my dictionary (a simple JSON file in this case).

import os, glob, json

ipa_dict_path = 'data/ipa-dict.json'

def load_json_file_or_dict(filename):
    """Load data from json file if exists otherwise return empty dict."""
    if os.path.isfile(filename):
        with open(filename) as f:
            return json.load(f)
    return dict()

def save_to_json_file(data, filename):
    """Save data to json file."""
    with open(filename, 'w') as f:
        json.dump(data, f)

def load_ipa_dict():
    """Load IPA dict from json file."""
    return load_json_file_or_dict(ipa_dict_path)

def save_ipa_dict(ipa_dict):
    """Save IPA dict to json file."""
    save_to_json_file(ipa_dict, ipa_dict_path)

Next, I iterate through the words of the books and enter each word and whatever is returned by eng-to-ipa into my dictionary.

def harry_potter_text():
    """Return entire content of the Harry Potter books in a single string."""
    data = []
    for filename in glob.glob('data/json/Harry Potter*'):
        with open(filename) as f:
            data.append(json.load(f)['text'])
    return ' '.join(data)
    
def populate_ipa_dict_from_text(text):
    """Get all IPA information from eng_to_ipa and save to ipa_dict."""
    ipa_dict = load_ipa_dict()
    words = preprocess(text).split()
    for word in set(words) - set(ipa_dict.keys()):
        ipa = eng_to_ipa.convert(word, retrieve_all=True, keep_punct=False,
                                 stress_marks=False)
        ipa_dict[word] = ipa
    save_ipa_dict(ipa_dict)
    
populate_ipa_dict_from_text(harry_potter_text())

Each value in the dictionary is now a list of possible IPAs as that is what eng-to-ipa returned. Before the dictionary can be used, we need to ensure that each word has exactly one IPA. If the different IPAs for a word decode to different numbers we need to ask the user for their preferred pronunciation. Otherwise, the choice is of no consequence and we simply choose the first one.

def get_ipa_pref(word, ipa):
    """Let user choose preferred IPA for word."""
    print()
    print(f'Choose IPA for word "{word}":')
    for i, w in enumerate(ipa):
        print(f'{i}: {w}')
    choice = int(input('Your choice: '))
    print()
    return ipa[choice]

def disambiguate_ipas_in_ipa_dict():
    """Disambiguate IPAs for each word by choosing one."""
    ipa_dict = load_ipa_dict()
    try:
        for word, ipa in ipa_dict.items():
            if isinstance(ipa, list):
                if len(ipa) == 1 \
                   or len(set(tuple(major_decode_from_ipa(x)) for x in ipa)) == 1:
                    ipa_dict[word] = ipa[0]
                else:
                    ipa_dict[word] = get_ipa_pref(word, ipa)
    except:
        save_ipa_dict(ipa_dict)
        raise
    save_ipa_dict(ipa_dict)
    
disambiguate_ipas_in_ipa_dict()

Example:

Choose IPA for word "sandwich":
0: sæmwɪʧ
1: sændwɪʧ
2: sænwɪʧ
Your choice: 1

Next, I search the IPA dict for values followed by an asterisk (which as you’ve seen earlier, indicates that no IPA was found in the CMU dictionary) and attempt to scrape it from YouGlish.

import re, collections

def find_missing_ipas(text):
    """Attempt to source missing IPAs or delete from ipa_dict, frequent words first."""
    ct = collections.Counter(preprocess(text).split())
    ipa_dict = load_ipa_dict()
    for word, count in ct.most_common():
        if word not in ipa_dict:
            continue
        ipa = ipa_dict[word]
        if m := re.match(r'(\S+)\*', ipa):
            print(ct[word], end='\t')
            ipa = ipa_from_youglish(m.group(1))
            if ipa:
                ipa_dict[word] = ipa
            else:
                del ipa_dict[word]
            save_ipa_dict(ipa_dict)
            
find_missing_ipas(harry_potter_text())

And that’s all, my IPA dictionary is ready! I define a few functions for converting text to IPA conveniently.

class NoIPAFound(Exception):
    """Failed to find IPA representation in ipa_dict."""

def word_to_ipa(ipa_dict, word):
    """Return IPA for word."""
    try:
        return ipa_dict[word]
    except KeyError:
        raise NoIPAFound(word)

def text_to_ipa(ipa_dict, text):
    """Return all IPA for text."""
    words = preprocess(text).split()
    return ' '.join(word_to_ipa(ipa_dict, x) for x in words)

def has_ipa(ipa_dict, text):
    """Test if there is IPA for text."""
    try:
        ipa_from_text(ipa_dict, text)
    except NoIPAFound:
        return False
    return True

For example, text_to_ipa(load_ipa_dict(), 'Well done, Harry!') yields wɛl dən hɛri.

Using the major_decode_from_ipa function from the previous step I can now convert text to numbers.

> major_decode_from_ipa(text_to_ipa(load_ipa_dict(), 'Well done, Harry!'))
[5, 1, 2, 4]

To make it even more convenient, I define a function major_decode_from_text.

def major_decode_from_text(ipa_dict, text, group_by_words=False):
    """Decode text and return number sequence."""
    return [major_decode_from_ipa(text_to_ipa(ipa_dict, x)) for x in text.split()] \
            if group_by_words else major_decode_from_ipa(text_to_ipa(ipa_dict, text))

This function has the added benefit that it can group the result by the words in the source sentence which is useful when you’re printing the number sequence for a longer text.

> major_decode_from_text(load_ipa_dict(), 'Well done, Harry!', group_by_words=True)
[[5], [1, 2], [4]]
> ' '.join(numseq_to_str(x) for x in _)
5 12 4

Step 4: Practice decoding words

In order to use the major system effectively, you need to be able to quickly convert text to numbers in your mind.

Now that you can convert text to numbers using the computer, it’s easy to write a simple training program for practicing doing the same thing in your mind.

import crayons

def train_decoding():
    """Interactively train decoding of words."""
    ipa_dict_items = list(load_ipa_dict().items())
    while True:
        word, ipa = random.choice(ipa_dict_items)
        numseq = major_decode_from_ipa(ipa)
        while True:
            print('Word is', crayons.blue(word, bold=True))
            user_input = input('Enter number: ')
            user_numseq = str_to_numseq(user_input)
            if numseq == user_numseq:
                print(f'{word} -> {ipa} -> {numseq}')
                print(crayons.green('Correct!\n'))
                break
            else:
                print(crayons.red('Try again!\n'))
                
train_decoding()

Example:

Word is wanting
Enter number: 2127
Try again!

Word is wanting
Enter number: 212
wanting -> wɑntɪŋ -> [2, 1, 2]
Correct!

Word is essence
Enter number: 020
essence -> ɛsəns -> [0, 2, 0]
Correct!

Word is tightness
Enter number: 

Continue practicing until the text-to-number conversion becomes second nature to you.

Step 5: Practice encoding numbers

When you want to memorize a number sequence using the major system, you need to find an appropriate encoding for it. The following code helps you practice this concept.

def train_encoding(min_numseq_len=1, max_numseq_len=4):
    """Interactively train encoding of numbers."""
    ipa_dict = load_ipa_dict()
    while True:
        numseq_len = random.randint(min_numseq_len, max_numseq_len)
        numseq = [random.randint(0, 9) for _ in range(numseq_len)]
        numseq_str = numseq_to_str(numseq)
        while True:
            print('Number is', crayons.magenta(numseq_str, bold=True))
            user_input = input('Enter text: ')
            try:
                user_ipa = text_to_ipa(ipa_dict, user_input.strip())
            except NoIPAFound as e:
                print(crayons.red(f'No IPA found for "{e.args[0]}". Try again!\n'))
                continue
            user_numseq = major_decode_from_ipa(user_ipa)
            print(f'{user_input} -> {user_ipa} -> {user_numseq}')
            if user_numseq == numseq:
                print(crayons.green('Correct!\n'))
                break
            else:
                print(crayons.red('Try again!\n'))

Example:

Number is 61
Enter text: shit
No IPA found for "shit". Try again!

Number is 61
Enter text: jet
jet -> ʤɛt -> [6, 1]
Correct!

Number is 1982
Enter text: tube fan
tube fan -> tjub fæn -> [1, 9, 8, 2]
Correct!

Finding good (memorable) encodings for a given number sequence is a much more creative (and laborious) process than decoding text. Let’s see if we can use the content of the Harry Potter books to encode number sequences.

Step 6: Find encodings automatically

Given a number sequence I’m going to search the Harry Potter books for suitable encodings. The processes involved can be slow, thus precomputing all possible encodings and saving it to index files makes sense.

Here are a few convenience functions that allow me to create index files from lists of strings and query them for number sequences.

def load_numseq_index(filename):
    """Load (number sequence -> text) index."""
    index = load_json_file_or_dict(filename)
    return {k: set(v) for k, v in index.items()}

def save_numseq_index(index, filename):
    """Save (number sequence -> text) index."""
    index = {k: list(v) for k, v in index.items()}
    save_to_json_file(index, filename)

def combine_numseq_indexes(indexes):
    """Combine multiple (number sequence -> text) indexes into one."""
    d = dict()
    for index in indexes:
        for k, v in index.items():
            for x in v:
                d.setdefault(k, set()).add(x)
    return d

def build_numseq_index_from_strings(filename, ipa_dict, strings, extend_index=True):
    """Create an index for the mapping from number sequences to text and save it."""
    index = load_numseq_index(filename) if extend_index else dict()
    for string in strings:
        try:
            numseq = major_decode_from_text(ipa_dict, string)
        except NoIPAFound:
            continue
        numseq_str = numseq_to_str(numseq)
        index.setdefault(numseq_str,  set()).add(string)
    save_numseq_index(index, filename)
    
def find_encodings_for_numseq_with_index(numseq, index):
    """Check index for numseq and return encoding if found."""
    numseq_str = numseq_to_str(numseq)
    return list(index.get(numseq_str, []))

def find_encodings_for_numseq_with_index_file(numseq, filename):
    """Load index file and pass to find_encodings_for_numseq_with_index."""
    index = load_numseq_index(filename)
    return find_encodings_for_numseq_with_index(numseq, index)

Next, I’m going to create several indexes for different types of text chunks extracted from the Harry Potter books.

Words

Well duh! All words that can be encoded are already in our IPA dictionary so finding words for number sequences is straightforward.

ipa_dict = load_ipa_dict()

# Simply use the keys of ipa_dict as they are the words
build_numseq_index_from_strings('data/numseq-word-index.json', ipa_dict,
                                ipa_dict.keys())

To use this index, I defined the following function.

def find_words_for_numseq(numseq):
    return find_encodings_for_numseq_with_index_file(numseq,
                                                     'data/numseq-word-index.json')

Example:

> find_words_for_numseq([0, 5, 1, 4, 2])
['slytherin', 'slaughtering', 'slithering', 'sweltering']

Nouns

Nouns are generally easier to imagine (and thus memorize) than other types of words. For this reason, it makes sense to look for nouns specifically when trying to find encodings for a number sequence.

I’m going to use Python’s NLTK package to process the Harry Potter text and identify nouns. The process required for this is called part-of-speech tagging (POS tagging).

Here is an example to give you an idea of what the POS tagger does.

> s = ('"It is the unknown we fear when we look upon death and darkness,'
       ' nothing more," said Dumbledore.')
> nltk.pos_tag(nltk.tokenize.word_tokenize(preprocess(s)))
[('it', 'PRP'), ('is', 'VBZ'), ('the', 'DT'), ('unknown', 'JJ'), ('we', 'PRP'), ('fear', 'VBP'),
 ('when', 'WRB'), ('we', 'PRP'), ('look', 'VBP'), ('upon', 'IN'), ('death', 'NN'), ('and', 'CC'),
 ('darkness', 'NN'), ('nothing', 'NN'), ('more', 'JJR'), ('said', 'VBD'), ('dumbledore', 'NN')]

Now, I’m not a linguist but unknown seems like a noun to me in this case, yet it is marked as an adjective (JJ). Anyway, NLTK’s POS tagger does a good job identifying nouns generally. You can see the tag for nouns starts with NN.

Note: If you remove the call to preprocess, the tag for Dumbledore will be NNP (proper noun), not NN. That is because preprocess lowercases the entire text and NLTK uses casing to determine the correct tag.

For a list of tags see Penn Treebank P.O.S. Tags or run nltk.help.upenn_tagset() in Python.

Thus, the following code will extract all nouns from a piece of text.

import nltk

def nouns_from_text(text):
    """Extract nouns from text."""
    tokens = nltk.tokenize.word_tokenize(preprocess(text))
    return set(word for word, pos in nltk.pos_tag(tokens)
               if pos[:2] == 'NN')

Then I can build my noun index like so.

build_numseq_index_from_strings('data/numseq-noun-index.json', load_ipa_dict(),
                                nouns_from_text(harry_potter_text()))

Noun phrases

Noun phrases are nouns with modifiers, e.g. ridiculous muggle protection act, restricted section, giant gryffindor hourglass, slow-acting venoms, mrs weasley, ….

Noun phrases are useful for my purpose because they are as easy to remember as individual nouns (if not easier due to being more concrete) and have the potential to encode longer number sequences.

The technique I’m going to use to find noun phrases is called Chunking. Chunking is the segmentation and labelling of multi-token sequences. In other words it takes tokens (e.g. a tokenized sentence) as input and produces non-overlapping subsets of those tokens.

The way a list of tokens is chunked is defined by a chunk grammar. Here’s the grammar I’m going to use to find noun phrases.

NBAR:
    {<JJ.*>*<NN.*>+}  # Adjectives and Nouns
NP:
    {<NBAR>}
    {<NBAR><IN><NBAR>}  # Above, connected with in/of/...

The first rule in this grammar says that an NBAR chunk should be formed whenever the chunker finds zero or more adjectives (of all types, that’s why it’s JJ.* not just JJ) followed by one or more nouns (of all types).

The second rule says that an NP chunk should be formed whenever the chunker finds an NBAR chunk and whenever it finds two NBAR chunks connected by a preposition.

Examples for phrases that would be caught by the second rule but not by the first are head protruding over ron, death eater in disguise, ministry of magic, sound of laughter, ray of purest sunlight, cup of strong tea.

The following is the code I use for extracting noun phrases from any text.

def gen_chunks_with_grammar(text, grammar, tags, loop=1, min_token_num=1):
    """Generate chunks of text using an NLTK grammar."""
    cp = nltk.RegexpParser(grammar, loop=loop)
    for sent in nltk.tokenize.sent_tokenize(text):
        tokens = nltk.tokenize.word_tokenize(preprocess(sent))
        tagged = nltk.pos_tag(tokens)
        for subtree in cp.parse(tagged):
            if isinstance(subtree, nltk.Tree) and subtree.label() in tags:
                leaves = subtree.leaves()
                if len(leaves) >= min_token_num:
                    yield ' '.join(x[0] for x in leaves)

def noun_phrases_from_text(text):
    """Extract noun phrases from text."""
    grammar = r"""
        NBAR:
            {<JJ.*>*<NN.*>+}  # Adjectives and Nouns
        NP:
            {<NBAR>}
            {<NBAR><IN><NBAR>}  # Above, connected with in/of/...
    """
    return gen_chunks_with_grammar(text, grammar, ['NP'], min_token_num=2)

The first function is just a helper function that preprocesses, tokenizes and POS tags the text, then feeds it to the chunker, iterates over the produced subsets and reassembles the chunks I’m interested in into text.

Simple nouns would also be considered noun phrases by the grammar I used, thus I only consider the subsets consisting of more than one token.

And I build the index for noun phrases.

build_numseq_index_from_strings('data/numseq-noun-phrase-index.json', load_ipa_dict(),
                                noun_phrases_from_text(harry_potter_text()))

Clauses

A clause is a group of words containing a subject and a verb and functions as a member of a sentence.

I used the same technique as for noun phrases but with a different grammar. I took the grammar I used from the NLTK docs.

def clauses_from_text(text):
    """Extract clauses from text."""
    grammar = r"""
    NP: {<DT|JJ|NN.*>+}          # Chunk sequences of DT, JJ, NN
    PP: {<IN><NP>}               # Chunk prepositions followed by NP
    VP: {<VB.*><NP|PP|CLAUSE>+$} # Chunk verbs and their arguments
    CLAUSE: {<NP><VP>}           # Chunk NP, VP
    """
    return gen_chunks_with_grammar(text, grammar, ['CLAUSE'])

A few example clauses:

voldemort took no notice
great eyes fixed on harry in an expression of watery adoration
dumbledore is dead
snape said nothing
professor lupin gave a snort
the effect was immediate
i had no idea

And I build the index.

build_numseq_index_from_strings('data/numseq-clause-index.json', load_ipa_dict(),
                                clauses_from_text(harry_potter_text()))

Sentences

Sentences are easily extracted by a simple call to nltk.tokenize.sent_tokenize.

def remove_double_quotation_marks(s):
    return s.translate(str.maketrans(dict.fromkeys(['\u201C', '\u201D', '\u0022'])))

sents = nltk.tokenize.sent_tokenize(remove_double_quotation_marks(harry_potter_text()))
build_numseq_index_from_strings('data/numseq-sentence-index.json', load_ipa_dict(), sents)

Double quotation marks need to be removed otherwise they affect sentence tokenization. Consider the following example.

“Are you sure? You could be great, you know, it’s all here in your head,
and Slytherin will help you on the way to greatness, no doubt about that — no?
Well, if you’re sure — better be GRYFFINDOR!”
Harry heard the hat shout the last word to the whole hall.

This would be sent-tokenized like so:

['“Are you sure?',
 'You could be great, you know, it’s all here in your head, and Slytherin will '
 'help you on the way to greatness, no doubt about that — no?',
 'Well, if you’re sure — better be GRYFFINDOR!” Harry heard the hat shout the '
 'last word to the whole hall.',
 'He took off the hat and walked shakily toward the Gryffindor table.']

The quotation mark following the exclamation mark in the third sentence prevents the sent tokenizer from breaking it up into two sentences.

With double quotation marks removed this problem is solved.

['Are you sure?',
 'You could be great, you know, it’s all here in your head, and Slytherin will '
 'help you on the way to greatness, no doubt about that — no?',
 'Well, if you’re sure — better be GRYFFINDOR!',
 'Harry heard the hat shout the last word to the whole hall.']

Now that we have all the indexes we can define the remaining find functions.

def find_nouns_for_numseq(numseq):
    return find_encodings_for_numseq_with_index_file(numseq,
                                                     'data/numseq-noun-index.json')

def find_noun_phrases_for_numseq(numseq):
    return find_encodings_for_numseq_with_index_file(numseq,
                                                     'data/numseq-noun-phrase-index.json')

def find_clauses_for_numseq(numseq):
    return find_encodings_for_numseq_with_index_file(numseq,
                                                     'data/numseq-clause-index.json')

def find_sentences_for_numseq(numseq):
    return find_encodings_for_numseq_with_index_file(numseq,
                                                     'data/numseq-sentence-index.json')

# And a couple functions that combine the indexes and find it all
def load_combined_numseq_index():
    files = glob.glob('data/numseq-*-index.json')
    return combine_numseq_indexes([load_numseq_index(x) for x in files])

def find_all_for_numseq(numseq):
    index = load_combined_numseq_index()
    return find_encodings_for_numseq_with_index(numseq, index)

Let’s do a quick integrity check and search for a number sequence mentioned earlier in this post.

> find_all_for_numseq(str_to_numseq('5 8 4 64 914 9 74821'))
['Well, if you’re sure — better be GRYFFINDOR!']

All good.

Measuring the coverage of the indexes

Try finding encodings for a random number sequence and you’ll quickly realize that there won’t be any results for many cases.

> find_all_for_numseq(str_to_numseq('9834 3029'))
[]

In order to determine how useful our indexes are we need to measure the probability of at least one encoding being found for a random number sequence of a certain length.

import itertools

def gen_all_numseqs_of_length(length):
    return [list(x) for x in itertools.product(range(10), repeat=length)]

def calc_coverage_for_numseq_len(length):
    """Return percentage of numseqs (of given length) that have at least one encoding."""
    index = load_combined_numseq_index()
    numseqs = gen_all_numseqs_of_length(length)
    encodings = [x for x in numseqs if find_encodings_for_numseq_with_index(x, index)]
    return len(encodings) / len(numseqs)

for length in range(1, 7):
    print(length, calc_coverage_for_numseq_len(length))

And the result is:

1 1.0
2 0.99
3 0.802
4 0.3603
5 0.07279
6 0.008393

Not very impressive! You are very unlikely to find a match even for something as simple as a phone number. To improve the numbers one might use more books (and thus content) and repeat the previous steps. One could also use more complicated NLP techniques to recombine text chunks into new phrases and sentences.

I will however show you a technique to encode number sequences of any lengths that don’t rely purely on a high coverage of encodings for number sequences.

Step 7: Find noun sequences

In order to encode number sequences of an arbitrary length you need to combine several encodings and link them together. For example, to encode 2184775142 you could use the words wand frog cauldron. Then, create a story with those words to link them. For example, to memorize the word sequence wand frog cauldron you could imagine using a wand to conjure a frog inside a cauldron.

I’m using nouns since they are especially well suited for building stories.

Here’s a function to find noun sequences automatically from our indexes. The function is interactive, it asks you to choose a word from a list of possible nouns repeatedly until the entire number sequence is encoded.

def interactive_find_noun_sequences_for_numseq(numseq):
    """Interactively find noun sequences for a number sequence."""
    nouns = list(load_numseq_index('data/numseq-noun-index.json').items())

    def _find(s):
        if not s:
            return []
        options = {v: n for n, w in nouns for v in w if s.startswith(n)}
        print('Remaining digits:', crayons.yellow(s, bold=True))
        print(', '.join(options.keys()))
        while True:
            user_noun = input('Choose next noun: ')
            if user_noun in options:
                break
        print()
        user_num = options[user_noun]
        remainder = s[len(user_num):]
        return [user_noun] + _find(remainder)

    return _find(numseq_to_str(numseq))

Step 8: Build your personal word list

You don’t always have your computer at hand to find good encodings for you. Plus, picturing new encodings every time you need to memorize a number requires a higher mental effort and is time-consuming.

It’s better (and faster) to reuse the same encodings over and over again. In order to memorize number sequences efficiently, you should have a list of encodings for all numbers from 00 to 99.

For this purpose, I created a few functions that will help you build your personal word list and save it to a JSON file.

def load_wordlist(filename):
    if os.path.isfile(filename):
        with open(filename) as f:
            return json.load(f)
    return dict()

def save_wordlist(wordlist, filename):
    with open(filename, 'w') as f:
        json.dump(wordlist, f)

def interactive_create_wordlist(filename='data/wordlist.json'):
    wordlist = load_wordlist(filename)
    numbers = [numseq_to_str(x)
               for x in itertools.chain(w for z in range(1, 3)
                                        for w in gen_all_numseqs_of_length(z))]
    missing = sorted(set(numbers) - set(wordlist.keys()))
    for num in missing:
        print('Number:', crayons.green(num, bold=True))
        print('Ideas:', ', '.join(find_nouns_for_numseq(num)))
        user_noun = input('Enter chosen noun: ')
        wordlist[num] = user_noun
        save_wordlist(wordlist, filename)
        print()

Step 9: Memorize your word list with the help of the Leitner system

Memorizing a hundred number-word pairs does require a significant effort but with the help of the Leitner system nothing shall stop you.

The Leitner system is an implementation of the good old principle of spaced repetition. Spaced repetition is a strategy for learning facts. The main idea is that you review facts less frequently the more often you remembered them correctly.

You may use real flashcards of course but here are a few functions that implement a simple Leitner system.

def load_leitner_data(filename):
    return load_json_file_or_dict(filename)

def save_leitner_data(data, filename):
    save_to_json_file(data, filename)

def leitner_add_fact(fact, filename='data/leitner.json'):
    data = load_leitner_data(filename)
    facts = data.setdefault('facts', [])
    for f in facts:
        if fact['front'] == f['front']:
            f['back'] = fact['back']
            f['box'] = 0
            break
    else:
        facts.append(dict(front=fact['front'], back=fact['back'], box=0))
    save_leitner_data(data, filename)

def leitner_train(box, filename='data/leitner.json'):
    print('Training box:', box, '\n')
    data = load_leitner_data(filename)
    facts = data.setdefault('facts', [])
    random.shuffle(facts)
    for f in facts:
        if f['box'] == box:
            print('Front:', f['front'])
            user_back = input('Back: ')
            if user_back == f['back']:
                print(crayons.green('Correct!'))
                f['box'] = max(f['box'] + 1, 2)
            else:
                print(crayons.red('Wrong!'), 'Correct answer is', crayons.yellow(f['back']))
                f['box'] = 1
            print()
            save_leitner_data(data, filename)
    print('Done')

def interactive_leitner(filename='data/leitner.json', max_box=5):
    data = load_leitner_data(filename)
    counts = {x: 0 for x in range(max_box+1)}
    for f in data['facts']:
        counts[f['box']] += 1
    counts = sorted(counts.items(), key=operator.itemgetter(0))
    print('Leitner boxes:', ', '.join(f'{crayons.magenta(box, bold=True)}: {ct}'
                                      for box, ct in counts))
    while True:
        try:
            user_box = int(input('Choose box to train: '))
        except ValueError:
            continue
        if user_box >= 0 and user_box <= max_box:
            break
    print()
    leitner_train(user_box)
    print()

Run interactive_leitner() to start training. It will show you the number of facts in each box and ask you which box you would like to review.

In this case, I used box 0 as a staging box for new cards. This way you can add hundreds of facts into the system at once without overwhelming your learning capacity.

When reviewing box 0, if you answer correctly the fact jumps to box 2, otherwise to box 1. If you think you introduced enough new facts (I suggest 5-10), press Ctrl-C to stop.

When reviewing the other boxes, if you answer correctly the fact jumps to the next-higher box, otherwise back to 1. Finish reviewing the box or press Ctrl-C to stop.

Once a fact reaches the box after max_box, so box 6 in this case, it is considered successfully learned and does not appear to be reviewed again.

Review the boxes with decreasing frequency e.g. box 1 daily, box 2 every other day, box 3 once a week and so on.

Adding the word list created in the previous step to our Leitner system is straightforward.

def add_wordlist_to_leitner(filename):
    wordlist = load_json_file_or_dict(filename)
    for num, word in wordlist.items():
        leitner_add_fact(dict(front=num, back=word))
        
add_wordlist_to_leitner()

Continue practicing until all facts are beyond box 5.

Step 10: Practice memorizing number sequences

You are now fully ready to use the major system to memorize number sequences. Consider the following training program. It shows you a number to memorize, then distracts you with a mental math exercise before it asks you to enter the memorized number.

I used curses for some advanced user interaction in the terminal.

import curses

def group_numseq_str(numseq_str, group_size=2):
    """12345 -> 12 34 5"""
    return ' '.join(''.join(x)
                    for x in itertools.zip_longest(fillvalue='',
                                                   *[iter(numseq_str)] * group_size))

def curses_input(stdscr, prompt):
    curses.echo()
    stdscr.addstr(prompt)
    stdscr.refresh()
    user_input = []
    while True:
        key = stdscr.getkey()
        if key == '\n':
            break
        user_input.append(key)
    curses.noecho()
    return ''.join(user_input)

def mental_exercise(stdscr):
    curses.echo()
    a, b = (random.randint(0, 1000) for _ in range(2))
    while True:
        try:
            stdscr.clear()
            answer = int(curses_input(stdscr, f'{a} + {b} = '))
        except ValueError:
            continue
        if answer == a + b:
            break
    curses.noecho()

def interactive_memorize_numseq(numseq):
    numseq_str = numseq_to_str(numseq)
    numseq_str_grouped = group_numseq_str(numseq_str)

    def _func(stdscr):
        curses.init_pair(1, curses.COLOR_BLUE, curses.COLOR_BLACK)
        curses.init_pair(2, curses.COLOR_GREEN, curses.COLOR_BLACK)
        curses.init_pair(3, curses.COLOR_RED, curses.COLOR_BLACK)
        while True:
            stdscr.clear()
            stdscr.addstr('Number sequence: ')
            stdscr.addstr(numseq_str_grouped, curses.color_pair(1) | curses.A_BOLD)
            stdscr.refresh()

            stdscr.getkey()
            mental_exercise(stdscr)
            stdscr.clear()
            user_numseq_str = ''.join(x for x in curses_input(stdscr, 'Enter number: ')
                                      if x.isdigit())
            if user_numseq_str == numseq_str:
                stdscr.clear()
                stdscr.addstr('Correct!', curses.color_pair(2))
                stdscr.getkey()
                break
            else:
                stdscr.clear()
                stdscr.addstr('Wrong! Try again.', curses.color_pair(3))
                stdscr.getkey()

    curses.wrapper(_func)

def train_memorizing(min_numseq_len=4, max_numseq_len=8):
    while True:
        numseq_len = random.randint(min_numseq_len, max_numseq_len)
        numseq = [random.randint(0, 9) for _ in range(numseq_len)]
        interactive_memorize_numseq(numseq)
        
train_memorizing()

Step 11: Apply the major system to real life situations

While the mnemonic major system seems tedious at first, with enough practice it becomes an incredibly efficient technique for memorizing long (or short) number sequences quickly.

In everyday life, there are many opportunities for memorizing number sequences. For example, the next time you’re eating out memorize the prices for each dish you order and impress everyone at the end when it’s time to split the check and no-one knows what they owe.

Add numbers you want to memorize long-term to the Leitner system created in Step 9. For example, to add important emergency phone numbers in case you loose your phone:

leitner_add_fact(dict(front='phone: snape', back='+44 20 7946 0724'))
leitner_add_fact(dict(front='phone: dark lord', back='+40 711 5554 820'))

Conclusion

In this post, I introduced the mnemonic major system, showed you how to manually and automatically decode text to number sequences, find encodings in the content of your favorite fantasy books, build your personal wordlist and use it to memorize number sequences of arbitrary lengths and how to write your own code to train all these concepts.

The code for this post can be found on GitHub.

I hope you found the information in this post useful. In future posts, I will explore other ways to apply data science to your favorite fantasy literature and maybe have a look at other memory techniques as well.

Share your feedback in the comments and most importantly, start memorizing!

Update 1: Both sounds ð and θ should decode to 1

As suggested in a comment on Hacker News, the voiced and unvoiced pair ð, θ belong together and should decode to 1.

Originally, I put θ into the group of sounds that decode to 8 because the sound feels intuitively closer to f than d to me.

However, many words can be pronounced either with ð or with θ which means that separating them increases the number of words that require a manual IPA disambiguation in Step 3. Thus, I am convinced that it is better to have them both decode to 1.

Update 2: Word list based on Characters

Based on NF’s suggestion in the comments, I’m going to create a list of two-digit pegs using characters from the Harry Potter books. Instead of using the entire word(s) for decoding however only the first 2 decoded digits shall be used, otherwise we couldn’t find encodings for all number sequences of length 2.

The character names are quickly extracted with the help of a technique called Named-entity recognition (NER). NER identifies so-called entities and places them into categories such as PERSON, DATE, GPE (which stands for geopolitical entity), …

In this case, using SpaCy instead of NLTK is simpler and faster. See SpaCy’s docs for a full list of named entity types.

import spacy

# You need to run "spacy download en" to get the English model
nlp = spacy.load('en')

def extract_persons(text):
    """Extract all persons from text that occur more than once."""
    persons = list()
    for sent in nltk.tokenize.sent_tokenize(remove_double_quotation_marks(text)):
        doc = nlp(sent)
        for ent in doc.ents:
            if ent.label_ == 'PERSON':
                persons.append(ent.text)
    return [x for x, ct in collections.Counter(persons).most_common() if ct > 1]


persons = extract_persons(harry_potter_text())
with open('data/harry-potter-persons.txt', 'w') as f:
    f.write('\n'.join(persons))

This will output the identified characters into a text file. Unfortunately, for most of the names we don’t have IPAs so you will have to build your word list manually.

You can reuse the interactive_create_wordlist function from before though, just use a different filename and ignore the presented nouns. Instead, look for suitable names in the file with the character names we just created.

> interactive_create_wordlist(filename='data/character-pegs.json')
Number: 0
Ideas: haze, hiss, easy, c, house, oohs, wiz, sigh, ze, soo, eyes, see, s, yes, say, wise, ease, zoo, icy, saw, woes, use, z, ice, ways, sea, essay, hazy, c'est, so, sa, wheezy
Enter chosen noun: ice

Number: 00
Ideas: whizzes, sees, houses, hisses, saucy, cease, seize, uses, cissy, sighs, essays, size, saws, seas, sauce, wheezes
Enter chosen noun: sauce

Number: 01
Ideas: hestia, suit, sorta, side, waste, swayed, set, stew, sweat, soda, sweet, east, sat, hoist, site, soot, sight, waist, seed, city, seat, eyesight, aside, sad, stay, sweaty, asset, west, host, acid, seaweed, haste, sit, sweetie
Enter chosen noun: Sturgis

Number: 02
Ideas: sane, whizzing, scene, housing, swing, zone, sewn, hissing, sun, sonny, hassan, snow, swaying, sung, icing, seeing, sin, son, sang, snowy, sign, sunny, song, wheezing, swan, zan, swung, swine
Enter chosen noun: Snape

Honestly, I didn’t find a good name for each number sequence so I used nouns sometimes. Locations (GPE and LOC) could be extracted and used for the word list as well.