Go Big or Go Home: Movies Revenue Analysis
Data imputation and regression analysis on 3400 Hollywood movies
Introduction and Objectives
About 700 movies are released to the masses each year in America. Released movies must compete against each other to gain as much revenue as possible.
The objectives of this project are using regression analysis on the given dataset to find out:
- The relationships between the predictors and the movie revenue.
- Which predictors are the most significant in affecting the movie revenue.
Dataset Information
The dataset contains information from 3,400 Hollywood movies released in 2000–2020 which are separated into two tables. Here are the details of variable in each table:
Movie Daily Earnings Table
- Movie title
- Movie playing date
- Daily earnings
- Number of theaters playing the movie in given date
- Rank of the movie based on the daily earnings
Movie Titles Table
- Movie title
- Domestic earnings (USD)
- International earnings (USD)
- Budget (USD)
- Distributor name
- MPAA rating
- Runtime (minutes)
- Genres
Preparing the Dataset
Here area some steps to prepare the dataset before building the regression model.
- Checking the missing values.
- Data imputation on the missing values.
- Joining the tables into a single table.
You can check the details of the dataset preparation on this presentation deck.
The end result of these preparation process is a table consisting of these variables:
- Total earnings (million USD) — Target variable
- Percentage of international earning (%, compared to total earnings)
- Budget (million USD)
- Runtime (minutes)
- Total number of days the movie is playing in theaters
- Total number of days the movie is in top 5 rank in daily earnings.
- Average number of theaters playing the movie each day.
EDA: Scatter Matrix
Scatter Matrix of the Variables
The joined data then visualized as a scatter matrix using Seaborn’s pairplot.
# Create scatter matrix
sns.pairplot(df_mv_joined)p
Here are some insights from this plot:
1 — Data Distribution
Some variables tend to be right skewed: International earn percentage, budget, runtime duration, days in theaters, and daily number of playing theaters
The other variables are tend to show an exponential distribution: total earnings and days as a top 5 movie.
2 — Variables Relationship
There are likely some strong relationship between our predictors and the target variable.
Next, we can move on to building our regression model and check whether our initial assessment is true.
Building the Regression Model
We build the model and fit it into the data.
# Create OLS model object
model = smf.ols("total_earn_mio ~ budget_mio \
+ int_earn_perc + runtime_min \
+ days_in_theater + days_top_5 \
+ daily_n_theaters", df_mv_joined)
# Fit the model
results = model.fit()
Analyzing the Model
We can show the summary of the fitted regression model as follow:
1 — Intercept and Predictors’ Coefficient
There are positive coefficient on each predictors. But the negative intercept value make the interpretation of the model is not quite straight forward.
That said, we’ll do some standardization by centering the predictors to tackle this issue.
2 — P-value of each Predictor
In the context of a regression analysis, the null hypothesis is a statement that there is no significant relationship between the independent variable(s) and the dependent variable. We can construct these hypotheses into H0 and H1 statement.
Null Hypothesis (H0): No significant relationship between the predictor and the target variable
Alternative Hypothesis (H1): There is significant relationship between the predictor and the target variable
Significance level (α): 0.05
The above result shows that there is no evidence to reject the null hypothesis which said that the runtime duration of a movie doesn’t significantly affect a movie’s total earning since the p-value is 0.559 which is greater than the α value of 0.05.
On the other hand, other predictors are able to reject the null hypothesis and can be said that they have a significant relationship to the total earnings of a movie.
Further Actions
- Centering the predictors to standardize the results and make the model more straight forward to interpret.
- Discarding
runtime
variable from the predictors when building the new model since it has no significant relationship tototal_earn_mio
(the target variable).
Standardizing by Centering the Predictors in the Model
By centering the predictors, we basically calculate how far is each observation in a variable from its variable mean.
We can achieve it in Python using this code:
# Find the mean of each predictor
budget_mean = df_mv_joined['budget_mio'].mean()
int_earn_perc_mean = df_mv_joined['int_earn_perc'].mean()
days_in_theater_mean = df_mv_joined['days_in_theater'].mean()
days_top_5_mean = df_mv_joined['days_top_5'].mean()
daily_n_theaters_mean = df_mv_joined['daily_n_theaters'].mean()
# Create a new variable consisting the centered values of each predictor
df_mv_joined['budget_mio_cent'] = df_mv_joined['budget_mio'] - budget_mean
df_mv_joined['int_earn_perc_cent'] = df_mv_joined['int_earn_perc'] - int_earn_perc_mean
df_mv_joined['days_in_theater_cent'] = df_mv_joined['days_in_theater'] - days_in_theater_mean
df_mv_joined['days_top_5_cent'] = df_mv_joined['days_top_5'] - days_top_5_mean
df_mv_joined['daily_n_theaters_cent'] = df_mv_joined['daily_n_theaters'] - daily_n_theaters_mean
Then we build a standardized model using these centered predictors.
# Create OLS model object
model = smf.ols("total_earn_mio ~ budget_mio_cent \
+ int_earn_perc_cent + days_in_theater_cent + days_top_5_cent \
+ daily_n_theaters_cent", df_mv_joined)
# Fit the model
results_standardized_model = model.fit()
(Note: the runtime variable is not included because its relationship to the total earnings variable is not statistically significant.)
Model Interpretations
Here is the final coefficient and standard error of the standardized model:
Based on this table, we could reveal the relationship between these predictors to a movie’s total earnings:
- The
intercept
of 135 million USD represents the revenue of a movie that share the average values on all of its predictors. - The coefficient of
budget_mio_cent
indicates the expected difference of 2.5 million USD of revenue between two movies that differ by 1 million USD in budget but have the sane average values for all other predictors. - The coefficient of
int_earn_perc_cent
indicates the expected difference of 0.8 million USD of revenue between two movies that differ by 1 percent in their international earnings rate but have the same average values for all other predictors. - The coefficient of
days_in_theater_cent
indicates the expected difference of 0.5 million USD of revenue between two movies that differ by 1 day on their total number of days theaters playing the movies but have the same average values for all other predictors. - The coefficient of
days_top_5_cent
indicates the expected difference of 5.43 million USD of revenue between two movies that differ by 1 day of their total number of days as a top 5 movie (in daily earnings) but have the same average values for all other predictors. - The coefficient of
daily_n_theaters_cent
indicates the expected difference of 0.01 million or 10,000 USD of revenue between two movies that differ by 1 data-point of their daily number of theaters playing the movies but have the same average values for all other predictors.
Conclusions
Based on this regression model, the longer a movie stays in a top 5 rank (in daily earning) has the strongest effect on the movie total earnings. Budget of a movie comes second as the strongest predictor of its revenue.
On the other hand, the R-squared value of this model is 0.69, meaning that the predictors in this regression model can explain 69% of the variance in the dependent variable (total earnings), which considered relatively good.
Business and Further Research Recommendations
Business Recommendations
- Go big or go home. Gather a big budget for your movie. Every million you spend will likely give you 2.5 millions in revenue, assuming all other variables of your movie are not below the mean.
- Run a sprint, not a marathon. 1 day being the top 5 movies in daily earning will likely give a 10 times impact to the revenue more than prolonging the total number of days for the theaters to play your movie.
Further Research Recommendations
Due to time constraint, a more rigorous model evaluation is yet to be done. However, preliminary results suggest that the model is performing well in terms of generalizability.
It is also important to consider potential limitations and areas for improvement, such as incorporating categorical variables into the model considerations. Overall, further development and evaluation of this model is a promising step towards improving our understanding on how movie industry, Hollywood in general, works.