Analyzing Customer Transaction in Olist, a Brazilian E-Commerce Platform

Data wrangling with pandas and querying with pandasql

8 min readApr 16, 2023

Objective

Olist is a Brazilian e-commerce platform. The user in this platform can buy and sell various products ranging from computer accessories to beauty products.

In this project, I conducted a descriptive analysis to explore these 2 topics:

How well is the company’s growth in general? (Monthly GMV)
How is the customers transaction behavior based on location and time?

Data and Process Overview

The dataset consists of several data tables containing several data which could be highlighted as such:

Orders (order time, buyer, seller, reviews)
Customers (customer location)
Sellers (seller location)
Payments (payment method, payment value)
Products (product category)

These datasets are stored in a single .df file. To access the tables, I connected and queried the tables using SQLite.

After that, I converted each of the data tables into pandas dataframes to start the data wrangling and analysis process.

Identifying and Handling Problems in the Data

1. Missing data

Identifying how many datas are missing

I find out the percentage of the missing data using this line of code for each of the data table:

# Find the percentage of data is missing (NaN)
df.isna().sum()/len(df*100)

On this table (df_order), we can see that there are several columns that missing several percents of their data.

On the other hand, this table (df_order_reviews) missing most of its data in its two columns.

Based on the severity of the missing data, they would be handled accordingly.

Handling the identified missing data

Missing date data is filled using back filling (next observation carried backward). This method is selected because the nature of date-time data tend to be sequential.

And I think it’s safe to assume that the date value in the NaN observation would not exceed it’s next non-NaN observation.

# Fill the missing data using bacward filling
df_order["order_approved_at"] = df_order["order_approved_at"].bfill()
df_order["order_delivered_carrier_date"] = df_order["order_delivered_carrier_date"].bfill()
df_order["order_delivered_customer_date"] = df_order["order_delivered_customer_date"].bfill()

# Check the current condition
df_order.isna().sum()/len(df_order)*100

The end result looks like this (no more missing data):

On the other hand, the table which its columns missing most of its data would get the columns removed entirely.

There are several considerations I make:

It’s missing too much of it’s data.
I’m not going to use the data in the first place (it’s a non-categorical string data which is better suited for running a sentiment analysis — which is out of the scope of this project)

By running this code, we could remove the unwanted columns in this particular table (df_order_reviews):

# Delete/drop the columns
df_order_reviews = df_order_reviews.drop(['review_comment_title', 'review_comment_message'], axis=1)

# Check the current table condition
df_order_reviews.isna().sum()/len(df_order_reviews)*100

The output (the ‘review_comment_title’ and ‘review_comment_message’ columns are deleted):

2. Outlier

Identifying the outliers

As a starting point, we usedf.describe() to find out whether a table is likely to have outlier or not.

Here is an example of payment_value, a variable in df_order_payments which its max value is way further from its median.

We can run this line of code to visually check if the variable contains outliers:

# Plot the variable
sns.histplot(df_order_payments, x="payment_value", bins=100)

Handling the identified outliers

To tackle this problem, we first need to find the upper bound of each variable and then replace all the value higher than the upper bound with its median.

Here’s how we achieve that with Python:

# Define upper bound
upper_bound = df_order_payments["payment_value"].median() + df_order_payments["payment_value"].std()*3 # median is used because data is skewed

# Find the median of payment_value
med_payment_value = df_order_payments['payment_value'].median()
# Replace the outlier with the median
df_order_payments.loc[df_order_payments['payment_value'] > upper_bound, "payment_value"] = med_payment_value

And by doing that, we achieve a much smaller max value which is closer to its median.

By plotting the altered payment_value as a histogram, we can see the change in the distribution:

Writer’s note: I find this article very useful regarding outlier-handling.

3. Data type inconsistency

Identifying data type inconsistencies

We can identify the types of each variable using df.dtypes. Using this method, I managed to find several data type mismatches, the date-time datas still have object (string) data type.

Changing the data type

We can change the data type of a particular variable into a datetime data type using this code:

# Change the data type from string to datetime
df_order['order_purchase_timestamp'] = pd.to_datetime(df_order['order_purchase_timestamp'], format='%Y-%m-%d %H:%M:%S')

I also added format='%Y-%m-%d %H:%M:%S' parameter to match the datetime format of the timestamp value.

After conducting this method, we can see that the data type of the timestamps are now became datetime64[ns].

4. Format inconsistency

Identifying format inconsistencies

Format inconsistencies tend to appear in a categorical. To find the inconsistencies, we could use df[column_name].value_counts() and df[column_name].unique(). As we can see here, there are some format inconsistencies in the city names, one of the most notable ones is 'sao paulo’.

Standardizing the format

To standardize the format, I apply these 2 steps:

Character standardization
Name standardization

First I created a dictionary to map the unique characters to their ‘a-to-z’ counterpart. I decided to go through this step first to make the second step relatively easier.

dict_chara_replace = {
    "ã" : "a",
    "á" : "a",
    "â" : "a",
    "ç" : "c",
    "é" : "e",
    "ê" : "e",
    "í" : "i",
    "ó" : "o",
    "ú" : "u",
    "'" : " ",
    "£" : "",
    "ô" :"o",
    "õ" : "o",
    "-" : " ",
    "  " : " "
}

Now I create a function to replace the characters based on the characters map I just created.

# Create a function to standardize the characters
def standardize_chara(data, col_name):
    data[col_name] = data[col_name].str.lower()

    for chara in dict_chara_replace:
       data[col_name] = data[col_name].str.replace(chara, dict_chara_replace[chara])

# Standardize the characters
standardize_chara(df_sellers, 'seller city')

After that, I start the second step by creating a dictionary of non-standard city names to their standard counterpart.

After that, I replace the city names based on the map.

# Replace the inconsistencies of city names
df_sellers['seller_city'] = df_sellers['seller_city'].replace(map_city_names)

As we can see in this output below, the 'sao paulo' city now only have 1 format, the standard one.

Writer’s note: there is a more automated method to approach this issue, by using fuzzywuzzy library. But it’s not performant enough and too time consuming (it needs approx. 200 hours to process 1 million rows of data)

Analysis: Key Highlight(s)

Writer’s note: I use pandasql to query and analyze the wrangled dataset. One of the reasons I’m using this method is to keep all of my workflows in one workplace (in this case, Jupyter notebook running in VS Code).

#1 Decline in Gross Merchandise Value

Gross Merchandise Value (GMV) is the total value of merchandise sold over a given period of time through a customer-to-customer (C2C) exchange site. GMV usually used as a measure of the growth of an e-commerce platform.

Based on the dataset, it’s noted that the GMV is declining in the last 12 months. It’s particularly plummeting in September 2018.

#2 Sao Paulo as the leader in transaction value

When analyzed by the location, Sao Paulo comes in first as the leader of transaction value in the platform.

#3 People spend on their household, health, and hobby the most

Household, health-beauty, and hobby-related products are the most common product categories in the top 10 of transaction value.

#4 Favorite day to do shopping

The average payment value per transaction of the users throughout the week are relatively stagnant.

Payment Value per Transaction by Day of the Week

The chart is indicating that a single buyer tend to spend a relatively stagnant amount of money in each of his/her purchase throughout the week.

However, even if the value per transaction a buyer would spend stagnate throughout the week, the number of order is the highest in the Monday, indicating the highest merchandise value is produced in that day.

#5 Favorite payment method

Credit card is a prominent payment method used by the buyers with boleto (similar to virtual account in Indonesia) comes in second.

#6 Order delivery punctuality and its impact on the rating

Around 10% of the orders aren’t delivered on time (later than the estimated time). Even when broken down into weekdays, the on-time rate is relatively similar.

The delivery punctuality is likely to impact the overall order experience. It’s indicated by the average rating of late order is way lower than the punctual one.

References

Working Files

GitHub — bagusfarisa/olist-data-wrangling

You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com