What do movie characters’ relationships reveal about gender, and how has this changed over time?

While there has been incremental progress, the gender gap in movies is still quite wide. Utilizing subtitles and information from IMDB for over 15,000 movies, we found that there were just 3.3 women on average in the top 10 most central roles in movies across genres.

Data science has the potential to contribute to a wide range of social science questions. Here we turn attention to the portrayal of women in movies, an industry that has a significant influence on society impacting on aspects of life including self-esteem and career choice.

In the past couple of years, the gender gap in the film industry has attracted a lot of attention.  The issue is well-known: women are still underpaid and underrepresented.  This situation has to change, and we think the best way to change something is by shedding light on where the problem is.

How can data science help?

As data scientists, we decided to utilize network and machine learning algorithms to investigate the gender gap problem in the film industry by performing the largest analysis to date.

To this end, we fused data from the online movie database IMDb and a dataset of movie dialogue taken from closed caption subtitles to create the largest available corpus of movie social networks (15,540 networks).

Analyzing this, we investigated the role of on-screen women in the film industry over the past century. First, we combined data on movie subtitles with the IMDb dataset. Next, using named entity recognition (NER), we extracted the movie characters from the subtitles and linked them to the actors. Then, we built a social network of interactions among the movie characters.

How to build a social network from subtitles


To better understand how our algorithm works, let’s look at three lines from the “The Matrix” subtitles as an example. First, using NER we detect where and when each character names appeared in the subtitles. In this case, we have a scene where Morpheus talks with Neo. To find the actor named and to verify that it is a character, we match the names found in the subtitles with the character list from IMDb.

Lastly, using the matched characters, we created a link between characters that appeared in the movie in a time interval less than a predefined threshold. In our example, we know that Morpheus introduces himself to Neo, and we know that Morpheus and Neo are talking within an interval of 5 seconds.

If 5 seconds was smaller than the predefined threshold, we connect and edge between Neo and Morpheus. We do this process for all the lines in the movie subtitles which results in a weighted social network where the edge weight is the number of times two nodes (characters) appeared together.


The evolution of female representation in the Star Wars movies series (Copyright Dima Kagan, Thomas Chesney & Michael Fire CC BY 4.0).

Looking on the center

Using these networks, we investigated the difference between genders in movies. We thought it would be interesting to analyze the number of women in the top-10 roles according to their centrality in the network.

Despite improvements over the last century, on average there are still twice as many men as women in the leading roles in films.

The main roles are the most important in a film that gets most of the spotlight and it is important to see enough women in these roles. We found that, on average, women play fewer central roles in films with a very evident gap.

Over the last century, the number has been constantly growing; however, today on average there are still twice as many men as women in the top-10 roles in films. This result indicates that on average women have more minor roles.

How to measure fair representation?

Today, the most well-known measure of how fairly women represented in films is the Bechdel test. To pass the test the film has to pass three criteria: (1) it has to have at least two women in it, who (2) who talk to each other, about (3) something besides a man.

The first thing that came to our mind it is not easy to check every movie manually to see if it passes the test, why not to automate it. We used the network we constructed to extract network-based features and we created an automated Bechdel test based on machine learning algorithms.

Using our automated Bechdel test we found that some movies currently are misclassified it terms of their Bechdel test score. Additionally, we used it to quantify unclassified movies discovering an increase in the number of movies passing the Bechdel test.

Marilyn Monroe was an American actress famous for being cast as comedic “blonde bombshell” characters. Image by skeeze, Pixabay CC-0.

While the Bechdel test is certainly a useful and important test, it fails to account for many parameters such as the centrality of the characters, repression, etc.  Basically, if there is a movie with only two women who appear in one scene and talk about something other than men for few seconds, then the movie will pass the traditional Bechdel test.

We strongly believe that today we should have a test that provide more accurate measure female representation in movies. We propose a test that measures the gender gap using the total degree (number of interaction) of each gender.

We believe a good rule of thumb should be:

Sadly, only 12% of all movies pass this test. That being said, we found much evidence of an improving trend in women representation in movies.

These results highlight that usage of large amounts of data alongside advanced algorithms has a high potential to advance gender inequality study.   Future studies using similar methods can also analyze TV series and additional types of media uncovering additional gaps.

View the latest posts on the On Society homepage