Cat or Dog? Classifying pets by name

by Richard in 
deep-learning nlp predictive-models python

Classifying images into cats and dogs is a very popular application of deep learning, but what about their names? Can you tell from the name of a pet whether it is a cat or a dog? To some extent, yes. For example, a pet named “Mr. Tiddles” or “Garfield” is very likely to be a cat, whereas “Rex”, “Rover” or “Fido” is probably a dog.

To what extent are cat and dog names different? It it possible to determine whether a pet is a cat or dog from the name alone? How accurately? Are there some telltale signs which make some names cattier or doggier than others?

I happen to have a classifier for a similar problem lying around that I want to share because it was so successful. However, I cannot give details of the original application since it came from a commerical setting. This led me to the cat and dog problem.

Step 1: Thought Experiment

I like to get started on a problem like this by thinking about how a human would approach it. How do I know, upon hearing a name like Garfield, whether it belongs to a dog or a cat? The obvious answer is: I know it is probably a cat because I have encountered a cat of that name before.

I guess that the vast majority of cases fall into this category. So, all I need is a database of names of cats and dogs, and I can look up the name of an unseen pet in the database and decide whether it is a cat or dog by counting how many of each type of pet had that name.

This simple look-up approach can be implemented as a naive Bayes classifier, which is a standard kind of classifier. It has the advantage of being extremely simple.

One obvious disadvantage of this approach is how to cope with a name which is not in the database? This problem will presumably go away if you have enough data (if you have the name of every pet in the world, for example) but it is nevertheless true that many people like to give their pets unique and unusual names which have to be dealt with.

A second possible approach is the linguistic approach. Here, you would analyse the name and extract features. For example, I imagine that names beginning with R are more likely to be associated with dogs, because they have a growling quality. This sort of thing cannot possibly be detected by looking names up in a list. However, nowadays it is quite straightforward to build NLP (natural language processing) models to approach problems like this, using tools which have been developed within the last decade or so.

Step 2: Gather Data

Searching for pet name data gives a lot of top ten lists, but not many lists of all pets in a given location. The key search term here is open data. Seattle has made a collection of pet licenses available via its open data portal here. I didn’t feel like one city would be enough, so I also got some open data from Toronto here.

Some processing is required because the Toronto data is a frequency list rather than individual pets. Also, it seems that pets in Toronto are registered each year, since some rare names re-occur on a regular basis. For this reason, I only took the most recent occurence of each name in the Toronto data set.

Finally, I combined the data sets and uploaded the data in a single csv file here. Unfortunately, the data set is too large to preview, but we’ll see what it looks like in a minute.

Step 3: Modelling

I imported the data using Pandas. I chose to prepare a modelling dataset using only the Seattle data.

pets = pd.read_csv("https://raw.githubusercontent.com/rtrvale/datasets/master/pets_seattle_toronto.csv")
pets.head()
Unnamed: 0	name	species	year	city	count
0	0	NaN	dog	2000	Seattle	1.0
1	1	FANCY	dog	2000	Seattle	1.0
2	2	SKIP	dog	2000	Seattle	1.0
3	3	KANGA	dog	2000	Seattle	1.0
4	4	OSCAR	dog	2000	Seattle	1.0

# some cleaning required (there are missing names but no missing species)
seattle_species = seattle_species[pd.notna(seattle_names)]
seattle_names = seattle_names[pd.notna(seattle_names)]

seattle_species = seattle_species.reset_index()['species']
seattle_names = seattle_names.reset_index()['name']

RNN

The text classification RNN is the less straightforward of the two models, so I started with that one, following the Tensorflow tutorials.

The first step is to convert each pet name into an array of numbers, so that it can be fed into a neural network. This could be done by mapping each character to its ascii code, but Tensorflow has a slightly more sophisticated encoder. I am not entirely sure how it works.

# build an encoder
import tensorflow_datasets as tfds
encoder = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    seattle_names, target_vocab_size=1000)

# example
encoder.encode('REX')
[24, 840]

A name is encoded as a list of integers, which in general will be shorter than the name itself. The input data is created by encoding all of the names in seattle_names.

input_data = []
for i in range(len(seattle_names)):
  input_data += [encoder.encode(seattle_names[i])]

The encoded names are different lengths. However, they must all be the same length in order to be fed into a neural network. Therefore, they are padded to the length of the longest encoded name (in this case, 22).

# padding the input data

np.max([len(x) for x in input_data])
# 22

for x in input_data:
  if len(x) < 22:
    x += [0]*(22 - len(x))
    
# convert to numpy array
input_data = np.array(input_data).reshape((len(input_data), 22))

The y-variable will have a 1 if the animal is a cat. I chose to downsample in order to have a balanced data set

y = (seattle_species == "cat")

dog_sample = np.random.choice(np.where(~y)[0], 15824)
cat_sample = np.where(y)[0]

yb = pd.concat([y[dog_sample], y[cat_sample]])
y_raw = yb

X = input_data[yb.index]
X_raw = [encoder.decode(X[i]) for i in range(X.shape[0])]

# (y_raw and X_raw are to be kept for later)

yb = tf.keras.utils.to_categorical(yb)
yb = yb.reshape((len(yb), 2))

Now everything is set up to be fed into a network.

X.shape
(31648, 22)

yb.shape
(31648, 2)

The data is shuffled so that cats and dogs appear in a random order.

shuff = np.random.permutation(np.arange(yb.shape[0]))

X = X[shuff, :]
yb = yb[shuff, :]

I took the network directly from the tutorial example, except for adding a dropout layer.

model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64, input_length=22),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(100)),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
])

model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])

The model achieves roughly 60% accuracy.

model.fit(X, yb, epochs=5, validation_split = 0.1)

Epoch 1/5
891/891 [==============================] - 8s 9ms/step - loss: 0.7226 - accuracy: 0.5262 - val_loss: 0.7164 - val_accuracy: 0.5504
Epoch 2/5
891/891 [==============================] - 7s 8ms/step - loss: 0.7053 - accuracy: 0.5835 - val_loss: 0.7021 - val_accuracy: 0.5924
Epoch 3/5
891/891 [==============================] - 7s 8ms/step - loss: 0.6908 - accuracy: 0.6165 - val_loss: 0.6995 - val_accuracy: 0.6000
Epoch 4/5
891/891 [==============================] - 7s 8ms/step - loss: 0.6853 - accuracy: 0.6290 - val_loss: 0.7000 - val_accuracy: 0.5940
Epoch 5/5
891/891 [==============================] - 7s 8ms/step - loss: 0.6832 - accuracy: 0.6349 - val_loss: 0.7010 - val_accuracy: 0.5937

To make a prediction, you have to encode the name of the pet, pad to length 22 (I’m not sure what happens if the encoded name is longer than 22 characters though?) and then run it through the trained network.

# how to make a prediction
enc = encoder.encode("FIDO")
enc = enc + [0]*(22-len(enc))
model.predict(np.array(enc).reshape((1,22)))

array([[0.08036249, 0.9196375 ]], dtype=float32)

Remember, [1, 0] is a cat, so this means that FIDO is very likely to be a dog.

Naive Bayes

The Naive Bayes model can be encoded from scratch just by writing a few loops. A crude way to deal with unseen names is to use the Levenshtein distance (edit distance) between strings. The classifier is built by setting up a dictionary of all word occurring in all known cat and dog names in Seattle. A new name is classified by splitting it into words, and then looking for the known words with the closest Levenshtein distance, and counting how many of them are cats and dogs.

(Notice that we have to use words instead of names, because some names have multiple words.)

from Levenshtein import distance

def nb_classifier(strings, cat):
  # set up dictionary to contain counts
  nb_dict = {}

  for i in range(len(strings)):
    # split string into tokens
    words = strings[i].split(" ")
  
    for word in words:
      word = word.strip()
      # if word does not occur, add it
      if word not in nb_dict:
        nb_dict[word] = [0, 0]
      # cat[i] = 0 if ith name is a dog, else 1
      # nb_dict[word] = [cat count, dog count], a list of length 2
      nb_dict[word][cat[i]] += 1
      
  return nb_dict
def classify(name, nb_dict):
  words = name.split(" ")

  # initialize outputs
  cat_prob = 1
  dog_prob = 1

  # get total numbers of cats and dogs
  total_cats = np.array([nb_dict[k][1] for k in nb_dict.keys()]).sum()
  total_dogs = np.array([nb_dict[k][0] for k in nb_dict.keys()]).sum()

  for word in words:
    
    cats = 0
    dogs = 0
    # convert word to upper case with no spaces
    word = word.strip().upper()
    # keep a record of which words are the closest
    min_dist = distance(word, list(nb_dict.keys())[0])
    for k in nb_dict.keys():
      dist = distance(word, k)
      # if k was closer than current closest word, use k instead
      if dist < min_dist:
        cats = nb_dict[k][1]
        dogs = nb_dict[k][0]
        min_dist = dist
        # if exact match, no need to search further
        if dist == 0:
          break
      # if k was as close as the current closest word, add the
      # counts of cats and dogs from word k to current totals
      elif dist == min_dist:
        cats += nb_dict[k][1]
        dogs += nb_dict[k][0]
    # calculate naive Bayes probabilities by multiplying
    cat_prob *= (cats + 1)/(total_cats + 1)
    dog_prob *= (dogs + 1)/(total_dogs + 1)
  return (cat_prob, dog_prob)

Comparison

In order to compare the two models, I used a single training and testing split.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_raw, y_raw, test_size=0.1)

Accuracy for naive Bayes.

nb1 = nb_classifier(list(X_train), list(y_train))

pred = []
for i in range(len(list(X_test))):
  pred += [classify(list(X_test)[i], nb1)]
  
predClass = pd.Series([x[0] > x[1] for x in pred])
predClass.index = y_test.index # re-indexing is necessary for pandas
tab = pd.crosstab(predClass, pd.Series(y_test))
(tab[0][0] + tab[1][1])/tab.sum().sum()
0.6189573459715639
tab
species	False	True
row_0		
False	1119	726
True	480	840

To have a fair comparsion, we need to do the whole process of training the RNN on the training data alone. This includes building the encoder! So the RNN code has to be written in a separate function.

def rnn_classifier(strings, cat):
  encoder = tfds.features.text.SubwordTextEncoder.build_from_corpus(
    strings, target_vocab_size=1000)
  
  # encode the strings
  encoded = [encoder.encode(x) for x in strings]
  max_encode_length = np.max([len(x) for x in encoded])
  for i in range(len(encoded)):
    encoded[i] += [0]*(max_encode_length - len(encoded[i]))

  X = np.array(encoded).reshape((len(encoded), max_encode_length))
  y = tf.keras.utils.to_categorical(cat)
  y = y.reshape((len(cat), 2))
 
  model = tf.keras.Sequential([
    tf.keras.layers.Embedding(encoder.vocab_size, 64, input_length=max_encode_length),
    tf.keras.layers.Bidirectional(tf.keras.layers.LSTM(100)),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(2, activation='softmax')
  ])

  model.compile(loss=tf.keras.losses.BinaryCrossentropy(from_logits=True),
              optimizer=tf.keras.optimizers.Adam(1e-4),
              metrics=['accuracy'])
  
  model.fit(X, y, epochs=3)
  return {'model':model, 'encoder':encoder, 'M':max_encode_length}

def classify_rnn(name, rnn):
  encoded = rnn['encoder'].encode(name)
  encoded += [0]*(rnn['M'] - len(encoded))
  pred = rnn['model'].predict(np.array(encoded).reshape((1, len(encoded))))
  return pred[0][1]

Now we can apply it to the training data

rnn = rnn_classifier(X_train, y_train)
Epoch 1/3
891/891 [==============================] - 7s 8ms/step - loss: 0.7235 - accuracy: 0.5171
Epoch 2/3
891/891 [==============================] - 7s 7ms/step - loss: 0.7103 - accuracy: 0.5738
Epoch 3/3
891/891 [==============================] - 7s 7ms/step - loss: 0.6918 - accuracy: 0.6154

(You can try training it for longer, but I don’t think it improves the out-of-sample performance much.)

# examples
classify('REX', nb1)
(0.0004480860325182435, 0.0011731734341393469)
# The interpretation is: dog (since the second number is bigger)

classify_rnn('REX', rnn)
0.0090955645
# This is the score for 'REX' being a cat; very small, as expected

Accuracy and cross-table:

pred = []
for i in range(len(X_test)):
  pred += [classify_rnn(list(X_test)[i], rnn)]

predClass = pd.Series([x > 0.5 for x in pred])
predClass.index = y_test.index
tab = pd.crosstab(predClass, pd.Series(y_test))
(tab[0][0] + tab[1][1])/tab.sum().sum()
0.608214849921011

tab
species	False	True
row_0		
False	1034	675
True	565	891

Note that the two models performed very similarly. I am not surprised. While I expect that there are some interesting insights which could be gained from the RNN, which would be invisible to naive Bayes, I think that the classification performance is likely to be better on longer texts. I do think the RNN is the more elegant of the two solutions.

Step 4: Implementation

Summary of the pros and cons of the two methods.

Naive Bayes

Pros
  • Easy to understand
  • Requires no packages (except for Levenshtein distance; straightforward to implement)
  • Can cope with non-English names (example: MICIO is correctly identified as a cat) provided that they occur in the data. This can be a huge advantage for some applications
  • It’s extremely fast! I had to add in an artificial delay to make it look as if the quote-unquote AI was quote-unquote thinking
Cons
  • Requires a large amount of data
  • Can be fooled by things like spelling errors

RNN

Pros
  • Plenty of room for experimentation
  • Likely to be a more meaningful model
Cons
  • More difficult to implement
  • Cannot easily be ported (for example, for embedding in this webpage, or rewriting in a different programming language)

Conclusion

Given unlimited time and resources, I expect that the RNN approach would outperform naive Bayes. But in reality, there are often constraints, and for this reason I think that naive Bayes wins every time.

I like the model so much that I embedded it in this page, using an implementation of Levenshtein distance in Javascript written by Ramesh Nair.

The Javascript code is in a Github repo. I also included a Jupyter Python notebook there, with the code from this post. There is also a standalone version of the page to play with without the explanation and code (but with working SVG animations! For some reason, I cannot get the SVG files to work well with the Jekyll backend used in Github blogs).

Cat or Dog?

Type your pet's name in the box, and the AI will calculate whether it is more likely to be a cat or a dog!
© Data Science Confidential 2020 All rights reserved.