Go Big or Go Home: Movies Revenue Analysis

Data imputation and regression analysis on 3400 Hollywood movies

6 min readJun 25, 2023

Introduction and Objectives

About 700 movies are released to the masses each year in America. Released movies must compete against each other to gain as much revenue as possible.

The objectives of this project are using regression analysis on the given dataset to find out:

The relationships between the predictors and the movie revenue.
Which predictors are the most significant in affecting the movie revenue.

Dataset Information

The dataset contains information from 3,400 Hollywood movies released in 2000–2020 which are separated into two tables. Here are the details of variable in each table:

Movie Daily Earnings Table

Movie title
Movie playing date
Daily earnings
Number of theaters playing the movie in given date
Rank of the movie based on the daily earnings

Movie Titles Table

Movie title
Domestic earnings (USD)
International earnings (USD)
Budget (USD)
Distributor name
MPAA rating
Runtime (minutes)
Genres

Preparing the Dataset

Here area some steps to prepare the dataset before building the regression model.

Checking the missing values.
Data imputation on the missing values.
Joining the tables into a single table.

You can check the details of the dataset preparation on this presentation deck.

The end result of these preparation process is a table consisting of these variables:

Total earnings (million USD) — Target variable
Percentage of international earning (%, compared to total earnings)
Budget (million USD)
Runtime (minutes)
Total number of days the movie is playing in theaters
Total number of days the movie is in top 5 rank in daily earnings.
Average number of theaters playing the movie each day.

EDA: Scatter Matrix

Scatter Matrix of the Variables

The joined data then visualized as a scatter matrix using Seaborn’s pairplot.

# Create scatter matrix
sns.pairplot(df_mv_joined)p

Here are some insights from this plot:

1 — Data Distribution

Some variables tend to be right skewed: International earn percentage, budget, runtime duration, days in theaters, and daily number of playing theaters

The other variables are tend to show an exponential distribution: total earnings and days as a top 5 movie.

2 — Variables Relationship

There are likely some strong relationship between our predictors and the target variable.

Next, we can move on to building our regression model and check whether our initial assessment is true.

Building the Regression Model

We build the model and fit it into the data.

# Create OLS model object
model = smf.ols("total_earn_mio ~ budget_mio \
                + int_earn_perc + runtime_min \
                + days_in_theater + days_top_5 \
                + daily_n_theaters", df_mv_joined)

# Fit the model
results = model.fit()

Analyzing the Model

We can show the summary of the fitted regression model as follow:

1 — Intercept and Predictors’ Coefficient

There are positive coefficient on each predictors. But the negative intercept value make the interpretation of the model is not quite straight forward.

That said, we’ll do some standardization by centering the predictors to tackle this issue.

2 — P-value of each Predictor

In the context of a regression analysis, the null hypothesis is a statement that there is no significant relationship between the independent variable(s) and the dependent variable. We can construct these hypotheses into H0 and H1 statement.

Null Hypothesis (H0): No significant relationship between the predictor and the target variable

Alternative Hypothesis (H1): There is significant relationship between the predictor and the target variable

Significance level (α): 0.05

The above result shows that there is no evidence to reject the null hypothesis which said that the runtime duration of a movie doesn’t significantly affect a movie’s total earning since the p-value is 0.559 which is greater than the α value of 0.05.

On the other hand, other predictors are able to reject the null hypothesis and can be said that they have a significant relationship to the total earnings of a movie.

Further Actions

Centering the predictors to standardize the results and make the model more straight forward to interpret.
Discarding runtime variable from the predictors when building the new model since it has no significant relationship to total_earn_mio (the target variable).

Standardizing by Centering the Predictors in the Model

By centering the predictors, we basically calculate how far is each observation in a variable from its variable mean.

We can achieve it in Python using this code:

# Find the mean of each predictor
budget_mean = df_mv_joined['budget_mio'].mean()
int_earn_perc_mean = df_mv_joined['int_earn_perc'].mean()
days_in_theater_mean = df_mv_joined['days_in_theater'].mean()
days_top_5_mean = df_mv_joined['days_top_5'].mean()
daily_n_theaters_mean = df_mv_joined['daily_n_theaters'].mean()

# Create a new variable consisting the centered values of each predictor
df_mv_joined['budget_mio_cent'] = df_mv_joined['budget_mio'] - budget_mean
df_mv_joined['int_earn_perc_cent'] = df_mv_joined['int_earn_perc'] - int_earn_perc_mean
df_mv_joined['days_in_theater_cent'] = df_mv_joined['days_in_theater'] - days_in_theater_mean
df_mv_joined['days_top_5_cent'] = df_mv_joined['days_top_5'] - days_top_5_mean
df_mv_joined['daily_n_theaters_cent'] = df_mv_joined['daily_n_theaters'] - daily_n_theaters_mean

Then we build a standardized model using these centered predictors.

# Create OLS model object
model = smf.ols("total_earn_mio ~ budget_mio_cent \
                + int_earn_perc_cent + days_in_theater_cent + days_top_5_cent \
                + daily_n_theaters_cent", df_mv_joined)

# Fit the model
results_standardized_model = model.fit()

(Note: the runtime variable is not included because its relationship to the total earnings variable is not statistically significant.)

Model Interpretations

Here is the final coefficient and standard error of the standardized model:

Based on this table, we could reveal the relationship between these predictors to a movie’s total earnings:

The intercept of 135 million USD represents the revenue of a movie that share the average values on all of its predictors.
The coefficient of budget_mio_cent indicates the expected difference of 2.5 million USD of revenue between two movies that differ by 1 million USD in budget but have the sane average values for all other predictors.
The coefficient of int_earn_perc_cent indicates the expected difference of 0.8 million USD of revenue between two movies that differ by 1 percent in their international earnings rate but have the same average values for all other predictors.
The coefficient of days_in_theater_cent indicates the expected difference of 0.5 million USD of revenue between two movies that differ by 1 day on their total number of days theaters playing the movies but have the same average values for all other predictors.
The coefficient of days_top_5_cent indicates the expected difference of 5.43 million USD of revenue between two movies that differ by 1 day of their total number of days as a top 5 movie (in daily earnings) but have the same average values for all other predictors.
The coefficient of daily_n_theaters_cent indicates the expected difference of 0.01 million or 10,000 USD of revenue between two movies that differ by 1 data-point of their daily number of theaters playing the movies but have the same average values for all other predictors.

Conclusions

Based on this regression model, the longer a movie stays in a top 5 rank (in daily earning) has the strongest effect on the movie total earnings. Budget of a movie comes second as the strongest predictor of its revenue.

On the other hand, the R-squared value of this model is 0.69, meaning that the predictors in this regression model can explain 69% of the variance in the dependent variable (total earnings), which considered relatively good.

Business and Further Research Recommendations

Business Recommendations

Go big or go home. Gather a big budget for your movie. Every million you spend will likely give you 2.5 millions in revenue, assuming all other variables of your movie are not below the mean.
Run a sprint, not a marathon. 1 day being the top 5 movies in daily earning will likely give a 10 times impact to the revenue more than prolonging the total number of days for the theaters to play your movie.

Further Research Recommendations

Due to time constraint, a more rigorous model evaluation is yet to be done. However, preliminary results suggest that the model is performing well in terms of generalizability.

It is also important to consider potential limitations and areas for improvement, such as incorporating categorical variables into the model considerations. Overall, further development and evaluation of this model is a promising step towards improving our understanding on how movie industry, Hollywood in general, works.