The story so far: This blog contains the laboratory notes of a research program to use data-science to invent a completely new kind of monster. We have had some early success. But the research is just getting started. Our initial research thrust is the acquisition and analysis of thousands of Wikipedia pages related to monsters.
The previous post described how our “one hop scrape” found 7078 potentially useful Wikipedia articles about monsters. Based on a quick manual check, we found that many of these 7078 pages included articles about monsters that were not in our original seed set of articles. In other words, the Wikipedia monster-collection system works. Sort-of. Plenty of the 7078 articles we harvested were not about monsters.
Here’s a picture showing where we are in the gathering data part of this research thrust:
If we can find ways of separating the articles about monsters from articles on other subjects, we can use a bootstrapping process to collect a greater and greater quantity of monster-data from Wikipedia. But before we really dive into the process of building a classifier to separate out just the monster articles from our Wikipedia data set, we should conduct some initial reconnaissance of the data we gathered. What are these 7078 articles are about?
To get a gist of what this pile of documents is about, I’ve used one of my favorite text-processing methods: Statistical Topic Modeling. In other projects, I’ve used STM to study collections of Science Fiction stories and compare the ideas they contained with US Department of Defense research projects, to understand the how groups in conflict differ in their attitudes towards current events (Hawes), and to map emotions to topics of discussion in online forums (Carlson).
The Statistical Topic Modeling algorithm uses statistical analysis finds groups of words that tend to show up together in the same documents. These collections of words are called topics. Every document in the data set (each of our 7078 Wikipedia articles in this case) is modeled as a mixture of the topics. In summary:
Statistics = counting words and keeping track of which documents they show up in and which words show up together.
Topics = groups of words that show up together.
Modeling = computing the mixture of topics for each document.
Here’s a picture that shows how topics and documents are related:
In this picture, we have three documents. The first document is focused entirely on magical weapons, so words like magic and sword, show up a lot. The second document talks about Mizuchi (you know, mythical Japanese water dragons!) We see a mixture of topics in here: swords, dragons and water. The last document talks about another kind of mythical water dragon: the Balaur. This document doesn’t talk about the weapons you need to kill a Balaur, so we don’t see any significant contributions from the magical weapons topic.
From this simple picture (which I just made up – it’s not an actual topic modeling result), it may look like I’ve solved the problem of ever needing to actually read anything again. Topic modeling tells you what everything is about!
Unfortunately, not really.
Topic modeling makes models. And a model isn’t the real thing. It’s an approximation with varying levels of fidelity. Derek Zoolander’s thoughts on models come to mind:
What is this, a center for ants?
How can we be expected to teach children to learn how to read if they can’t even fit inside the building?
Like the miniature architectural model that enraged Mr. Zoolander, a topic model of a collection of documents is like a collection of documents for ants. A topic model is a group of topics, where each topic is a group of words. The words don’t form sentences, or even necessarily appear in any kind order designed to make sense to a human because they were created to fit a probability distribution. But they do usually manage to convey the gist of a huge bunch of documents, just like the architectural model conveys the sense of what the real building will look like.
I chose to use 250 topics to model the one-hop-scrape documents. This means that I pointed a topic modeling algorithm at my collection of documents, specified “250” as the number-of-topics parameter, and set it running. (There’s a little more to it than that: if you want more details, or if you want to try it yourself, you can see the Jupyter Notebook I used to create my topic model here).
The output of this multi-hour execution of the Jupyter Notebook is a topic model with 250 topics. Essentially, it’s a table with 250 rows, where each row contains a weighted set of words. It can be a little overwhelming to look at a 250-topic model, but you can if you want to. A visualization of the highest-weighted words in each of the 250 topics in the model is here. And the full model is also in the git repo with the Jupyter Notebook.
Here is a more user-friendly introduction to the one-hop-scrape model:
First, let start with some of the topics that seem to make sense. These are topics where we can look at the words as humans and get the gist of what they represent. Below I’ve hand-picked five topics that seem cover coherent themes:
Here we see the topic number – a unique identifier for the topic, and the ten highest weighted words in each topic. I’ve color-coded the background of each cell so that darker cells indicate words with higher weights.
Take a look at topic 101: greek, zeus, mythology, and so on? Those are words from Greek mythology. Perhaps this topic model indicates a recurring theme of Greek mythology in our data? Topic 105 has, as its highest-weighted words, comics, marvel, and hulk. Those are words related to the Marvel universe. Perhaps a bunch of our 7078 Wikipedia articles must discuss Marvel comics and movies.
In the case of these intuitive guesses about what these topics tell us about the data, we’re right. Here are a few of the highest-weighted Wikipedia articles for each of these five topics. (Remember, all documents are mixtures of topics, so no document is going to be entirely “about” the concepts found in a single topic).
We can see that our intuitions (about these five topics, anyway) are somewhat justified. For example, documents about topic 101 tend to be documents about Greek mythology, and documents about topic 105 tend to be about entities in the Marvel universe.
Now let’s look at some topics that don’t spark my intuition. Below I’ve hand-picked six of the 250 topics whose words, to me, look like they were chosen more-or-less at random.
Topic 37 strangely mixes words like blob, methane, shark, and devil. Topic 136 is, to me, even more random. Arkansas, liver, and east kilbride? What is this telling us about our data set? Using the same approach as with the five topics that seem to make sense, let’s find Wikipedia articles weighted towards these topics and see what they’re about:
The documents we find related to topic 37 involve (among a few other things) various movies about the blob monster, and various movies about sharks or water monsters. The blob and mega-shark are distinct monsters, but they were collected together into this topic. Why?
One explanation is that my choice to use 250 topics was not enough to better-separate the distinct creatures in this data set. Without the freedom to separate documents about truly different subjects, we find that some things that us humans like to think of as separate ideas get grouped together in a single topic. Maybe a 500-topic model would produce separate topics for “blob” and blob-monster related words and “mega-shark”, and the words that tend to show up in discussions of mega-shark.
Topic 136 is even more eclectic. An examination of some of the Wikipeida pages related to this topic, Arkansas, East Kilbride, and Liver, does not reveal (to me, anyway) any underlying connection that I can make sense of. But the algorithm doesn’t care! It found a group of co-occurring words that improve the model’s explanation of the data.
So, is the topic model any good?
It depends! What do you want to do with it? If we wanted to use this topic model to organize our library of 7078 articles, then this topic model might not give us good results. Oh, an article about the Liver? That’s filed under Arkansas-Liver-East-Killbride. You want to read about methane? Sure – go look in the section of the library named Blob-Methane-Shark-Film. Pretty weird, right?
My goal was to build a topic model of the data set to get a “gist” of what’s in it. I claim it does that. I’ve perused each of the 250 topics in the model, and I feel like I have a sense for the type of data the one-hop-scrape collected. Yes – feeling like I have a gist of the content of my data is a very subjective metric. But this is my project, and, for this part of it, that’s the metric that I’m going with. Success!
There are other games we can play with the topic model. My ultimate goal is to invent completely new kinds of monsters. Let’s see if we can use the topic model to look for … unnatural pairings.
In the above introduction to topic modeling, I’ve portrayed topics as smallish groups of words. This was oversimplifying a bit. Each of the 250 topics in this model is actually a probability distribution of all of the 25,583 words in the data set that appear more than 50 times. (There were 405,820 unique words in the data. But, of those, only 25,583 appeared more than 50 times. My topic modeling process used “only” those 25,583 frequently-occurring words). Each topic is actually a 25,583-long vector of weights. Even though they are long, we can still use ordinary distance metrics, like Euclidean distance (a 3+ dimensional version of the Pythagorean theorem) or Cosine distance to measure how different one topic is from another.
Using a distance metric to compare topics lets us answer questions such as:
Which topic is most similar to topic X?
Which topic is most different from topic X?
In fact, we don’t need to ask questions like this about specific topics, one at a time. We can measure the distance between every pair of topics, and visualize the resulting matrix using a heat map. Here it is:
This table has one row for each topic and one column for each topic. Each cell represents the similarity (1-distance) between the row and column’s topic. Small numbers, shown in red, represent dissimilar topics. Large numbers (1 is the largest number here) are shown in yellow and green, and represent very similar topics. The table is sorted so that topics that are dissimilar, overall, from everything else are in the upper left.
The green diagonal is a good sign that I didn’t make a math mistake when I produced the table – The diagonal cells all have value “1”, which means that each topic is maximally similar to itself.
My ultimate goal with this research project is to create something totally new. One way to create new things is to combine existing things that usually don’t occur together. For the domain of monsters, this table lets us do that! The deep red cells are pairs of topics that are most dissimilar. These dissimilar pairs of topics could be fertile ground for new ideas.
Below are the five pairs of topics that are most dissimilar (also known as least-similar):
I am calling these dissimilar topics “unnatural pairs” since, to my entertainment-consuming sensibilities, they contain ideas we don’t see appearing together very much. For example, both the 168/160 and 168/183 pairs of topics suggest that dragons and aliens don’t have a lot in common, as far as commonly-used vocabulary. Unnatural pair 235/160 suggests, to me, that Greek mythology and the Alien franchise don’t have a lot in common.
Maybe these results are obvious. Of course dragons and aliens don’t interact much in fiction. Duh. But instead of viewing this whole analysis as an elaborate rube-Goldberg-like way of getting us to an obvious result, I like to think of it as a challenge. My data says that dragons and aliens are almost maximally dissimilar, from the perspective of the vocabulary used to discuss them. I’m interested in coming up with something new, and, from this analysis, mixing dragons and aliens together will do that!
Therefore, please stay tuned while I pour myself another drink and start writing a story about dragons and aliens…
David, Peter, “Science Fiction vs. Science Funding: Comparing What We Imagine to What We Invent.” The Small Wars Journal. September 12, 2017. Retrieved February 18, 2022 https://smallwarsjournal.com/jrnl/art/science-fiction-vs-science-funding-comparing-what-we-imagine-to-what-we-invent
Carlson, J. David, P., Hawes, T., Nolan, J. “A Simple Method for Intersecting Affect and Topical Space.” Sentiment Symposium. New York, NY. May 2013
Hawes, T. and David, P. “Assessing Attitudes in Unstructured Text.” 2nd International Conference on Cross-Cultural Decision Making: Focus 2012. (2012).