RecSys Challenge 2024: Exploratory Data Analysis¶
Introduction¶
Purpose
This article will cover the exploratory data analysis of the RecSys 2024 Challenge dataset. The content will be structured into the following sections:
- Data Preprocessing
- Functions
- Plot Functions
- Feature Functions
- Feature Analysis
- Overall Feature Analysis
- Article
- User
- Session
- Topic
- Devices
- Age
For more in-depth analysis, please check out the notebook!
About
This year's challenge focuses on online news recommendation, addressing both the technical and normative challenges inherent in designing effective and responsible recommender systems for news publishing. The challenge will delve into the unique aspects of news recommendation, including modeling user preferences based on implicit behavior, accounting for the influence of the news agenda on user interests, and managing the rapid decay of news items. Furthermore, our challenge embraces the normative complexities, involving investigating the effects of recommender systems on the news flow and whether they resonate with editorial values. [1]
Challenge Task
The Ekstra Bladet RecSys Challenge aims to predict which article a user will click on from a list of articles that were seen during a specific impression. Utilizing the user's click history, session details (like time and device used), and personal metadata (including gender and age), along with a list of candidate news articles listed in an impression log, the challenge's objective is to rank the candidate articles based on the user's personal preferences. This involves developing models that encapsulate both the users and the articles through their content and the users' interests. The models are to estimate the likelihood of a user clicking on each article by evaluating the compatibility between the article's content and the user's preferences. The articles are ranked based on these likelihood scores, and the precision of these rankings is measured against the actual selections made by users. [1]
Dataset Information
The Ekstra Bladet News Recommendation Dataset (EB-NeRD) was created to support advancements in news recommendation research. It was collected from user behavior logs at Ekstra Bladet. We collected behavior logs from active users during the 6 weeks from April 27 to June 8, 2023. This timeframe was selected to avoid major events, e.g., holidays or elections, that could trigger atypical behavior at Ekstra Bladet. The active users were defined as users who had at least 5 and at most 1,000 news click records in a three-week period from May 18 to June 8, 2023. To protect user privacy, every user was delinked from the production system when securely hashed into an anonymized ID using one-time salt mapping. Alongside, we provide Danish news articles published by Ekstra Bladet. Each article is enriched with textual context features such as title, abstract, body, categories, among others. Furthermore, we provide features that have been generated by proprietary models, including topics, named entity recognition (NER), and article embeddings. [2]
For more information on the dataset.
Key Metrics
We need to establish specific metrics and analyze how different features impact those metrics. Our platform generates revenue through both subscriptions and advertisements. User engagement is crucial because the more time users spend reading new articles, the greater our advertisement revenue. We will need this insight for our next section on model selection for a recommendation system.
Data Preprocessing¶
Let's start by importing the packages required for this section.
# Packages
from datetime import datetime
from plotly.subplots import make_subplots
import numpy as np
import pandas as pd
import plotly
import plotly.express as px
import plotly.graph_objects as go
Data Sources
Articles: Detailed information of news articles
Behaviors: Impression Logs.
History: Click histories of users.
# Load in various dataframes
# Articles
df_art = pd.read_parquet("Data/Small/articles.parquet")
# Behaviors
df_bev = pd.read_parquet("Data/Small/train/behaviors.parquet")
# History
df_his = pd.read_parquet("Data/Small/train/history.parquet")
How can we merge these data sources?
Info
We've noticed that there are columns shared among each of these data sources that we can use for joining.
- Articles <> Article ID <> Behavior
- History <> User ID <> Behavior
Before merging them together, we'll need to adjust the behaviors['articles_id_clicked']
feature.
# Convert datatype of column first
df_bev['article_id'] = df_bev['article_id'].apply(lambda x: x if type(s) == str else x.astype(np.int32) )
# Join bevhaiors to article
df = df_bev.join(df_art.set_index("article_id"), on="article_id")
# Join bevhaiors to history
df = df.join(df_his.set_index("user_id"), on="user_id")
# Drop all other dataframes from me
df_bev = []
df_his = []
df_art = []
device_type
from integer to string, gender
from float to string, postcodes
from float to string, article_id
from string to integer, and age
from an integer to a string. def device_(x):
"""
Changes the device input from a int to a str
Keyword arguments:
x -- int
Output:
str
"""
if x == 1:
return 'Desktop'
elif x == 2:
return 'Mobile'
else:
return 'Tablet'
def gender_(x):
"""
Changes the gender input from a float to a str
Keyword arguments:
x -- float
Output:
str
"""
if x == 0.0:
return 'Male'
elif x == 1.0:
return 'Female'
else:
return None
def postcodes_(x):
"""
Changes the postcodes input from a float to a str
Keyword arguments:
x -- float
Output:
str
"""
if x == 0.0:
return 'Metropolitan'
elif x == 1.0:
return 'Rural District'
elif x == 2.0:
return 'Municipality'
elif x == 3.0:
return 'Provincial'
elif x == 4.0:
return 'Big City'
else:
return None
# Preprocessing
df.dropna(subset=['article_id'], inplace=True)
# Change article IDs into int
df['article_id'] = df['article_id'].apply(lambda x: int(x))
df['article_id'] = df['article_id'].astype(np.int64)
# Change age from int to string
df['device_type'] = df['device_type'].apply(lambda x: device_(x))
# Change genders from float to string
df['gender'] = df['gender'].apply(lambda x: gender_(x))
# Change age to str it's a range
df['age'] = df['age'].astype('Int64')
df['age'] = df['age'].astype(str)
df['age'] = df['age'].apply(
lambda x: x if x == '<NA>' else x + ' - ' + x[0] + '9')
# Change postcodes from int to str
df['postcode'] = df['postcode'].apply(lambda x: postcodes_(x))
Functions¶
This section is divided into two types of functions used for EDA:
- Visualization
- Feature preprocessing
Plot Functions¶
We'll implement functions to generate the following visualizations:
- Single and Multiple Categorical Bar Plots
- Single and Multiple Categorical Histograms, Box Plots, and Bar plots
- Scatter plots to measure activity across a time period
Below is an example of a plot function implemented. This function will generate a histogram, a box plot, and a bar plot visualization for two features: This function is useful when comparing a categorical feature (such as Age) with a numerical feature (such as Read Time).
def multiple_subset_feature_visualization(
df_,
feature_1, feature_2,
feature_2_title, feature_1_title,
histogram_xaxis_title
) -> "Graph":
"""
Displays multiple plots: Histogram, Box, and Bar plots based on multiple features given.
Keyword arguments:
df_ -- list
feature_1 -- str
feature_2 -- str
feature_1_title -- str
feature_2_title -- str
histogram_xaxis_title -- str
Output:
Plotly graph object!
"""
# Make subplots object
fig = make_subplots(
rows=3, cols=1, subplot_titles=("<b>Histogram<b>", "<b>Box plot<b>", "<b>Average {} for {}<b>".format(feature_2_title, feature_1_title))
)
# Assign tmp_df based on feature
if feature_1 == 'age':
tmp_df = df_[df_['age'] != '<NA>']
else:
tmp_df = df_[~df_[feature_1].isnull()]
# Create a category list from the feature given
categories = [d for d in tmp_df[feature_1].unique()]
categories.sort()
# Iterate through each category and produce a histogram, boxplot, and bar plots for that subset of the data
for category_ in categories:
subset_feature_2 = tmp_df[tmp_df[feature_1]== category_][feature_2].values
avg = round(float(tmp_df[tmp_df[feature_1] == category_][feature_2].mean()), 3)
# Add histogram
fig.add_trace(
go.Histogram(
x=subset_feature_2,
name=str(category_) + ' Histogram',
),
row=1, col=1
)
# Add Boxplot
# Need to create an array that is similar to the array used in subset_feature_2, to name the traces!
xo = [str(category_) for x in range(0, len(subset_feature_2))]
fig.add_trace(
go.Box(
y=subset_feature_2, x=xo,
name=str(category_) + ' Box',
),
row=2, col=1
)
# Add Bar
fig.add_trace(
go.Bar(
x=[str(category_)], y=[avg],
text='<b>{}<b>'.format(avg),
textposition='outside',
name=str(category_) + ' Bar',
textfont=dict(
family='sans serif',
size=18,
color='#1f77b4'
)
),
row=3, col=1
)
# Update xaxis properties
fig.update_xaxes(
title_text='<b>{}<b>'.format(str(histogram_xaxis_title)), row=1, col=1
)
fig.update_xaxes(
title_text='<b>{}<b>'.format(str(feature_1_title)), row=2, col=1
)
fig.update_xaxes(
title_text='<b>{}<b>'.format(str(feature_1_title)), row=3, col=1
)
# Update yaxis properties
fig.update_yaxes(
title_text='<b>Count<b>', row=1, col=1, type = 'log'
)
fig.update_yaxes(
title_text='<b>{}<b>'.format(str(feature_2_title)), row=2, col=1, type ='log'
)
fig.update_yaxes(
title_text='<b>{}<b>'.format(str(feature_2_title)),
range=[0, 125], row=3, col=1
)
# Update subplot title sizes
fig.update_annotations(
font_size=20,
)
# Update title and height
fig.update_layout(
title_text="<b>Distributions of {} for {}<b>".format(
feature_2_title, feature_1_title),
height=750, width=1000,
font=dict(
family="Courier New, monospace",
size=16,
)
)
return fig
Feature Functions¶
These are helper functions designed to preprocess features, preparing them to be used in the previous visaulization functions.
They are split into separate sections for specific features:
- Article
- User
- Topic
The following code snippet demonstrates a function that will populate a dictionary.
def populate_dict(list_, dict_):
"""
Populates the dict from list indices
Keyword arguments:
list_-- list
dict_ -- dict:
Output:
None
"""
# Iterate through each list index and append the index as a key
for idx in list_:
if idx not in dict_:
dict_[idx] = 1
else:
dict_[idx] += 1
Feature Analysis¶
We aim to gain insights into which features we can utilize for our recommendation system. This analysis will be brief. For more information, check out the notebook which contains more in-depth feature analysis.
Main Questions:¶
These are questions we seek to address by the end of this analysis.
What features provide details about an article?
- Topic
- Read Time
- Scroll Percentage
What is the behavior of the following features?
- Article
- User
- Session
- Devices
- Ages
- Postcodes
Explain the activity of our users (activity across a time period)? Segment it based on our categorical features such as ages, devices, gender, and postcodes.
- Daily
- Hourly
- Weekly
- Day of the week
Describe the topic distribution across our categorical features such as ages, devices, and postcodes?
Overall Feature Analysis¶
How many impressions are there in total?
Solution
There are 70421 impressions in this data.
What does the distribution of read times look like?
Solution
Long-tailed distribution.
How is the distribution of scroll percentages represented?
Solution
Long-tailed distribution.
Article¶
What is the total number of articles?
Solution
There are 1723 articles.
How many unique articles are clicked in a single session?
Solution
There is a distribution starting from 1 to 19 articles clicked in a single session.
# How many unique articles are clicked in a session?
# Group by sessions and get the article ids
tmp_aps = df.groupby('session_id')['article_id'].apply(list)
# Create a dict to store the count of articles per session
articles_per_session = {k: 0 for k in range(1, 20)}
# Iterate through our list previously, and record the number of articles in a session to our res dict
for i in tmp_aps:
num_articles = len(i)
articles_per_session[num_articles] += 1
# Set as our indices / values for plot
indices = [k for k in articles_per_session.keys()]
values = [k for k in articles_per_session.values()]
# Plot
plot_bar(
indices_=indices, values_=values,
yrange_=[0, 5], xaxis_title='Number of Articles ',
yaxis_title='Count', title_='<b> Number of Articles clicked in a session<b>')
What is the average read time and scroll percentage for each article?
Solution
# Get the average readtime and scroll percentages for all articles!
# Unique User Ids
unique_user_ids = df['user_id'].values[0:1000]
# We take the set because the scroll, article per user is joined in a list for every user id (so just take the set of it!)
unique_user_ids = set(unique_user_ids)
# Unique Article Ids
unique_article_ids = df['article_id'].unique()
unique_article_ids = unique_article_ids[~np.isnan(unique_article_ids)]
# Create dictionaries
unique_article_read = {k: [0] for k in unique_article_ids}
unique_article_read_avg = {k: [0] for k in unique_article_ids}
unique_article_scroll = {k: [0] for k in unique_article_ids}
unique_article_scroll_avg = {k: [0] for k in unique_article_ids}
# Iterate across each user id
for id in unique_user_ids:
# Get the subset of that user id
tmp_df = df[df['user_id'] == id]
# Now lets go through each scroll and article
indices = np.array(tmp_df.index)
for i in indices:
tmp_dict = {}
# Select the scroll / article of that indice and
tmp_read = tmp_df['read_time_fixed'][i]
tmp_article = tmp_df['article_id_fixed'][i]
tmp_scroll = tmp_df['scroll_percentage_fixed'][i]
# Create list objects for article, read, scroll
read = [x for x in tmp_read]
scroll = [x for x in tmp_scroll]
articles = [np.int64(x) for x in tmp_article]
# Populate our unique_article_read dictionary based on the results found in our previous list objects
tmp_articles_read = {k: v for k, v in zip(articles, read)}
article_id_read_scroll(tmp_articles_read, unique_article_read)
# Populate our unique_article_scroll dictionary based on the results found in our previous list objects
tmp_articles_scroll = {k: v for k, v in zip(articles, scroll)}
article_id_read_scroll(tmp_articles_scroll, unique_article_scroll)
# Get the average scroll percentage and read times for each article
for k, v in zip(unique_article_read.keys(), unique_article_read.values()):
unique_article_read_avg[k] = np.mean(v)
for k, v in zip(unique_article_scroll.keys(), unique_article_scroll.values()):
unique_article_scroll_avg[k] = np.mean(v)
Average Read Time¶
The average reading time spans from 0 to 43.29 seconds, with outliers extending beyond this range.
# Distribution of Read Times for each Article
## Indices / Values
indices = ['<b>All Unique Articles<b>']
values = [x for x in unique_article_read_avg.values()]
## Plot
plot_box(
indices_=indices, values_=[values],
yrange_=[0, 3], xaxis_title='',
yaxis_title='Read Time(s)', title_='<b> Distributions of Read Times Across All Articles<b>')
Average Scroll Percentage¶
The average scroll percentage ranges from 0 - 100%.
# Distribution of Scroll Percentages for each Article
## Indices / Values
indices = ['<b>All Unique Articles<b>']
values = [x for x in unique_article_scroll_avg.values()]
## Plot
plot_box(
indices_=indices, values_=[values],
yrange_=[0, 2], xaxis_title='',
yaxis_title='Scroll Percentage (%)', title_='<b> Distributions of Scroll Percentage Across All Articles<b>')
User¶
What is the total number of users?
Solution
There are 9194 users.
Describe the behavior of daily user growth?
Solution
There are significant peaks observed in the first three dates, followed by a tapering off in the subsequent dates.
# Record the daily user growth
unique_user_ids = df['user_id'].unique()
# Create dictionaries
unique_users_daily_growth_freq= {}
unique_users_hourly_freq = {}
unique_users_dayofweek_freq = {}
unique_users_weekly_freq = {}
# Iterate through each user id and record the number of unique users present!
for id in unique_user_ids[0:1000]:
# Get the subset of that user id
tmp_df = df[df['user_id'] == id]
# Get the first index of that impression time
first_index = tmp_df['impression_time_fixed'].index[0]
# Record that join_date
tmp_datetime = pd.DatetimeIndex(tmp_df['impression_time_fixed'][first_index])
tmp_date = tmp_datetime[0].date()
join_date = tmp_date
# Populate our unique_user_daily_growth
if join_date not in unique_users_daily_growth_freq:
unique_users_daily_growth_freq[join_date] = 1
else:
unique_users_daily_growth_freq[join_date] +=1
# Sort our dict
unique_users_daily_growth_freq = dict(sorted(unique_users_daily_growth_freq.items()))
# Indices / Values for Plot
indices = [x for x in unique_users_daily_growth_freq.keys()]
values = [x for x in unique_users_daily_growth_freq.values()]
# Plot
plot_bar(indices_=indices, values_=values, yrange_=[
0, 3], xaxis_title='<b>Dates<b>', yaxis_title='<b>Count<b>', title_='<b>Daily User Growth<b>')
What is the average read time and scroll percentage across each unique user?
Solution
Read Time¶
# Read Time per User
# Group by User and Read Time
tmp_user_df = pd.DataFrame(data=df.groupby(by='user_id')[
'read_time'].mean(), columns=['read_time'])
# Plot
single_subset_feature_visualization(
df_=tmp_user_df, feature_='read_time',
data_title='Unique Users', feature_title ='Read Time(s)',
histogram_xaxis_title = 'Read Time(s)')
Scroll Percentage¶
# Scroll Percentage per User
# Group by User and Scroll Percentage
tmp_user_df = pd.DataFrame(data=df.groupby(by='user_id')[
'scroll_percentage'].mean(), columns=['scroll_percentage'])
# Plot
single_subset_feature_visualization(
df_=tmp_user_df, feature_='scroll_percentage',
data_title='Unique Users', feature_title ='Scroll Percentage(%)',
histogram_xaxis_title = 'Scroll Percentage(%)')
Describe the behavior of user activity?
Solution
# Record the daily, hourly, weekly, dayofweek activity across all users
# Get all unique ids in a list
unique_user_ids = df['user_id'].unique()[0:1000]
# Create dictionaries
unique_users_daily_freq = {}
unique_users_hourly_freq = {}
unique_users_dayofweek_freq = {}
unique_users_weekly_freq = {}
# Iterate through each user id
for id in unique_user_ids:
# Get the subset of that user id
tmp_df = df[df['user_id'] == id]
# Now lets go through each and populate the unique dates, hours and day of the week for each user
dates = []
hours = []
dayofweek = []
week = []
indices = np.array(tmp_df.index)
# Iterate through each index
for i in indices:
# Store the date, time, dayofweek, and week number
tmp_datetime = pd.DatetimeIndex(tmp_df['impression_time_fixed'][i])
tmp_date = tmp_datetime.date
tmp_time = tmp_datetime.time
tmp_dayofweek = tmp_datetime.weekday
tmp_week = tmp_datetime.isocalendar().week
# Append our dates, hours, dayofweek, week number
for j, k, l, m in zip(tmp_date, tmp_time, tmp_dayofweek, tmp_week):
dates.append(j)
hours.append(k)
dayofweek.append(l)
week.append(m)
# Get rid of duplicate values
unique_dates = list(set(dates))
unique_hours = list(set(hours))
unique_dayofweek = list(set(dayofweek))
unique_week = list(set(week))
# Convert to string
unique_hours = [x.hour for x in unique_hours]
unique_hours = [str(i) + ':00' if i > 9 else str(0) +
str(i) + ':00' for i in unique_hours]
# Convert the week int to mapping from 1++
unique_week = weekly_map(unique_week)
# Populate dicts
populate_dict(list_=unique_dates, dict_=unique_users_daily_freq)
populate_dict(list_=unique_hours, dict_=unique_users_hourly_freq)
populate_dict(list_=unique_dayofweek, dict_=unique_users_dayofweek_freq)
populate_dict(list_=unique_week, dict_=unique_users_weekly_freq)
# Sort our dicts
unique_users_daily_freq = dict(sorted(unique_users_daily_freq.items()))
unique_users_hourly_freq = dict(sorted(unique_users_hourly_freq.items()))
# Sort by integers for day of the week and then lets change the dict from int to str
unique_users_dayofweek_freq = dict(sorted(unique_users_dayofweek_freq.items()))
unique_users_dayofweek_freq = int_dow_dict(unique_users_dayofweek_freq)
unique_users_weekly_freq = dict(sorted(unique_users_weekly_freq.items()))
Daily User Activity¶
We notice a fluctuation between 675 and 850 users, with a substantial drop-off toward the end.
# Daily User Activity
## Indices / Values for Plot
indices = [x for x in unique_users_daily_freq.keys()]
values = [x for x in unique_users_daily_freq.values()]
## Plot
plot_scatter(
indices_=indices, values_=values,
yrange_=[200, 900], xaxis_title='Date',
yaxis_title='Active Users', title_='<b>Daily Active Users<b>'
)
Hourly User Activity¶
We notice a significant surge in users at 04:00 (4am), which remains relatively consistent until 21:00 (9:00 pm). Following that, there is a notable decline until 04:00.
# Hourly User Activity
## Indices / Values for Plot
indices = [x for x in unique_users_hourly_freq.keys()]
values = [x for x in unique_users_hourly_freq.values()]
## Plot
plot_scatter(
indices_ = indices , values_ = values,
yrange_ = [0, 20000], xaxis_title = 'Hour',
yaxis_title= 'Active Users', title_ = '<b>Hourly Active Users<b>'
)
Weekly User Activity¶
There is a consistent upward trend in the number of users from week 1 to week 4.
# Weekly User Activity
## Indices / Values for Plot
indices = [x for x in unique_users_weekly_freq.keys()]
values = [x for x in unique_users_weekly_freq.values()]
## Plot
plot_bar(
indices_ = indices, values_ = values,
yrange_ = [0, 3.5], xaxis_title = 'Week',
yaxis_title= 'Active Users', title_ = '<b> Weekly Active Users <b>')
Day of the Week User Activity¶
User activity remains consistent throughout all days of the week.
# Day Of The Week Activity
## Indices / Values for Plot
indices = [x for x in unique_users_dayofweek_freq.keys()]
values = [x for x in unique_users_dayofweek_freq.values()]
## Plot
plot_bar(
indices_ = indices, values_ = values,
yrange_ = [0, 3.5], xaxis_title = 'Day',
yaxis_title= 'Active Users', title_ = '<b> Day of the Week Activity <b>')
Session¶
What are the total number of sessions?
Solution
There are 36795 unique sessions.
What are the number of unique sessions per day?
Solution
The session count remains relatively stable within the range of 4000 to 6000 sessions until the final date, where a notable decline is observed.
# Number of unique sessions per day
# Make a copy of the dataframe and extract the time as a str
copy_df = df.copy()
copy_df['impression_time'] = copy_df['impression_time'].apply(
lambda x: x.date())
# Group by the session ids with the impression time
unique_sessions_per_day = copy_df.groupby(
by='session_id')['impression_time'].min()
tmp_dau_df = pd.DataFrame(data=unique_sessions_per_day.values,
index=unique_sessions_per_day.keys(), columns=['Session Dates'])
# Plot
multiple_subset_bar(
df_=tmp_dau_df, feature_='Session Dates',
yrange=[0, 4.5], xaxis_title = 'Session Dates')
What is average read time and scroll percentage for each unique session?
Solution
Read Time¶
# Read Time per Session
## Group by session ids and read_time
tmp_session_df = pd.DataFrame(data=df.groupby(by='session_id')[
'read_time'].mean(), columns=['read_time'])
## Plot
single_subset_feature_visualization(
df_=tmp_session_df, feature_='read_time',
data_title='Unique Sessions', feature_title = 'Read Time(s)',
histogram_xaxis_title ='Read Time(s)')
Scroll Percentage¶
# Scroll Percentage per Session
## Group by session ids and scroll percentage
tmp_session_df = pd.DataFrame(data=df.groupby(by='session_id')[
'scroll_percentage'].mean(), columns=['scroll_percentage'])
## Plot
single_subset_feature_visualization(
df_=tmp_session_df, feature_='scroll_percentage',
data_title='Unique Sessions', feature_title = 'Scroll Percentage(%)',
histogram_xaxis_title ='Scroll Percentage(%)')
Topic¶
How many topics are there in total?
Solution
There are 78 unique topics.
What are the top 10 most popular topics?
Solution
Kendt > Sport > Begivenhed > Underholdning > Sportsbegivenhed > Kriminalitet > Livsstill > Politik > Fodbold > Erhverv
# Record the frequency of topics across unique users, readtimes across topics, and scroll percentages across those topics
# Get all unique ids in a list
unique_user_ids = df['user_id'].values[0:1000]
# Create dictionaries
unique_users_topics_freq = {}
unique_topic_scroll_freq = {}
unique_topic_read_freq = {}
# Iterate through each user id and record the topics viewed!
for id in unique_user_ids:
# Get the subset of that user id
tmp_df = df[df['user_id'] == id]
# Now lets go through each topic
indices = np.array(tmp_df.index)
for i in indices:
# Record the topic, scroll percentage and read_time for each index
tmp_topics = tmp_df['topics'][i]
tmp_scroll = tmp_df['scroll_percentage'][i]
tmp_read = tmp_df['read_time'][i]
topics = [x for x in tmp_topics]
scroll = [tmp_scroll]
read = [tmp_read]
# Find the average scroll percentages across each topic (Can be related to whether a topic doesnt require too much reading has visualizations)
# Look at article_id for whichever topics the article is included in add that scroll percentage
tmp_topic_scroll = {k: v for k, v in zip(topics, scroll)}
unique_topic_scroll_freq = topics_article_id_scroll_read(
tmp_topic_scroll, unique_topic_scroll_freq)
# Find the average read time across each topic
# Look at article_id for whichever topics the article is included in add that readtime
tmp_topic_read = {k: v for k, v in zip(topics, read)}
unique_topic_read_freq = topics_article_id_scroll_read(
tmp_topic_read, unique_topic_read_freq)
# Unique User Topics
# Get rid of duplicate values
unique_topics = list(set(topics))
# Populate our dict
populate_dict(unique_topics, unique_users_topics_freq)
# Sort the dictionaries
sorted_topic_freq = dict(
sorted(unique_users_topics_freq.items(), key=lambda x: x[1], reverse=True))
# Find the average read times across each topic
unique_topic_read_avg_freq = {k: round(np.nanmean(v), 2) for k, v in zip(
unique_topic_read_freq.keys(), unique_topic_read_freq.values())}
sorted_unique_topic_read_avg_freq = dict(
sorted(unique_topic_read_avg_freq.items(), key=lambda x: x[1], reverse=True))
# Sort the topics for distribution
sorted_unique_topic_read_freq = dict(sorted(unique_topic_read_freq.items()))
# Find the average scroll percentages across each topic
unique_topic_scroll_avg_freq = {k: round(np.nanmean(v), 2) for k, v in zip(
unique_topic_scroll_freq.keys(), unique_topic_scroll_freq.values())}
sorted_unique_topic_scroll_avg_freq = dict(
sorted(unique_topic_scroll_avg_freq.items(), key=lambda x: x[1], reverse=True))
# Sort the topics scroll pct for distribution
sorted_unique_topic_scroll_freq = dict(
sorted(unique_topic_scroll_freq.items()))
# Distribution of Topics across users!
## Indices / Values for Plot
indices = [x for x in sorted_topic_freq.keys()][0:10]
values = [x for x in sorted_topic_freq.values()][0:10]
## Plot
plot_bar(
indices_=indices, values_=values,
yrange_=[0, 3], xaxis_title='Topics',
yaxis_title='Count', title_='<b> Top 10 Highest Topic Activity<b>')
What is the distribution of read time and scroll percentage across each topic?
Solution
Read Time¶
# Box Plot of Read Time across Topics
## Indices / Values for Plot
indices = [x for x in sorted_unique_topic_read_freq.keys()]
values = [x for x in sorted_unique_topic_read_freq.values()]
## Plot
plot_box(
indices_ = indices, values_ = values,
yrange_ = [0, 3.5], xaxis_title = 'Topics',
yaxis_title= 'Read Time(s)', title_ = '<b> Distributions of Read Times across each Topic<b>')
Scroll Percentage¶
# Box Plot of Scroll Percentage across Topics
## Indices / Values for Plot
indices = [x for x in sorted_unique_topic_scroll_freq.keys()]
values = [x for x in sorted_unique_topic_scroll_freq.values()]
## Plot
plot_box(
indices_ = indices, values_ = values,
yrange_ = [0, 2.1], xaxis_title = 'Topics',
yaxis_title= 'Read Time(s)', title_ = '<b> Distributions of Read Times across each Topic<b>')
How do daily and hourly activity patterns relate to each topic?
Solution
# Daily and Hourly Activity across each Topic
# Get all the unique topics
topic_list = unique_subset_topics(df)
unique_topics = sorted(topic_list)
# Get the list of each unqiue topic in a specific session
topics = df.groupby(by='session_id')['topics'].apply(list)
# Get the list of each unique timestamp for these sessions
timestamps = df.groupby(by='session_id')['impression_time'].apply(list)
unique_dates = []
# Create a list of hours in a str format
unique_hours = [i for i in range(24)]
unique_hours = [str(i) + ':00' if i > 9 else str(0) +
str(i) + ':00' for i in unique_hours]
# Iterate through each timestamp
for i in range(len(timestamps.values)):
# Iterate through each idx
for j in range(len(timestamps.values[i])):
# Assign datetime and date objects
tmp_datetime = timestamps.values[i][j]
tmp_date = tmp_datetime.date()
# if date not in unique dates, append
if tmp_date not in unique_dates:
unique_dates.append(tmp_date)
# Sort dates
unique_dates = sorted(unique_dates)
# Instantiate dict objects with unique dates and unique key values set to 0
unique_topic_daily_activity = {
k: {k: 0 for k in unique_dates} for k in unique_topics}
unique_topic_hourly_activity = {
k: {k: 0 for k in unique_hours} for k in unique_topics}
# Iterate through each session id
for i in zip(range(len(topics.values))):
# Iterate through each index of nested list
for j, k in zip(range(0, len(topics.values[i][0])), range(0, len(i))):
tmp = topics.values[i][0][j]
# Assign a datetime and time object
tmp_datetime = timestamps.values[i][k]
tmp_date = tmp_datetime.date()
tmp_time = tmp_datetime.time()
tmp_hour = tmp_time.hour
# Convert hour into a string
if tmp_hour > 9:
tmp_time = str(tmp_hour) + ':00'
else:
tmp_time = "0" + str(tmp_hour) + ':00'
# Add to dictionary
unique_topic_daily_activity[tmp][tmp_date] += 1
unique_topic_hourly_activity[tmp][tmp_time] += 1
Daily Activity¶
Daily activity vary significantly across topics, with some topics experiencing higher levels of activity due to their popularity.
# Daily Activity of Topics
activity_scatter(
dict_=unique_topic_daily_activity, yrange_=[0, 2100],
xaxis_title='Dates', yaxis_title='Active Users', title_='<b> Daily Active Users per Topic')
Hourly Activity¶
Devices¶
What is the distribution of devices?
Solution
There are 23536 desktop users, 44472 mobile users, and 2413 tablet users.
What is the distribution of read time and scroll percentages for devices?
Solution
Read Time¶
# Read Time across Devices
multiple_subset_feature_visualization(
df_=df, feature_1='device_type',
feature_2='read_time', feature_1_title='Devices',
feature_2_title='Read Time(s)', histogram_xaxis_title='Read Time(s)'
)
Scroll Percentage¶
What is the topic distribution for devices?
Solution
The topic distribution is relatively consistent across all devices: Kendt > Sport > Begivenhed, Underholdning > Kriminalitet
# Distribution of Topics Per Device
# Unique Topics
# Distribution of Topics Per Device
# Unique Topics
topic_list = unique_subset_topics(df)
unique_topics = sorted(topic_list)
# Plot
topic_feature_bar_distribution(
df_=df, feature_='device_type', yrange=[0, 4.5],
topic_list_=unique_topics, subplot_titles_=[
'<b>Desktop<b>', '<b>Mobile<b>', '<b>Tablet<b>'],
xaxis_title='<b>Topics<b>', yaxis_title='<b>Count<b>',
title_='<b>Topic Distribution Per Device<b>',
height_=750, width_=1000
)
What is the daily and hourly activity for devices?
Solution
The majority of active users are on mobile devices, outnumbering both desktop and tablet users combined. However, all activity plots exhibit similar patterns with varying magnitudes.
Age¶
What are the distribution of ages?
Solution
There's significant variation in the frequency of ages, with a majority of users falling within the 50-79 age range.
What is the distribution of read time and scroll percentages for ages?
Solution
Read Time¶
# Read Time across Ages
multiple_subset_feature_visualization(
df_=df, feature_1='age',
feature_2='read_time', feature_1_title='Age',
feature_2_title='Read Time(s)', histogram_xaxis_title='Read Time(s)'
)
Scroll Percentage¶
What is the topic distribution for age?
Solution
Five topics consistently appear across all age ranges, but their frequency varies within each age group.
# Distribution of Topics across Ages
## Get all the unique topics
topic_list = unique_subset_topics(df)
unique_topics = sorted(topic_list)
## Plot
topic_feature_bar_distribution(
df_=df, feature_='age', yrange=[0, 2.5],
topic_list_=unique_topics,
subplot_titles_=[
'<b>20-29<b>', '<b>30-39<b>', '<b>40-49<b>',
'<b>50-59<b>', '<b>60-69<b>', '<b>70-79<b>',
'<b>80-89<b>', '<b>90-99<b>'
],
xaxis_title='<b>Topics<b>', yaxis_title='<b>Count<b>',
title_='<b>Topic Distribution of Age Groups<b>',
height_=1000, width_=1000
)
What is the daily and hourly activity for age?
Solution
The daily and hourly activity trajectories are similar across each age group, except for the 90-99 age group, which appears scattered due to missing data on certain dates.
Postcodes¶
What is the distribution of postcodes?
Solution
There are 268 Big City users, 572 Metropolitan users, 190 Municipality users, 254 Provincial users, and 352 Rural District users.
What is the distribution read time and scroll percentage of postcodes?
Solution
Read Time¶
# Read Time across Postcodes
multiple_subset_feature_visualization(
df_=df, feature_1='postcode',
feature_2='read_time', feature_1_title='Postcodes',
feature_2_title='Read Time(s)', histogram_xaxis_title='Read Time(s)'
)
Scroll percentage¶
What is the topic distribution for postcodes?
Solution
Five topics consistently appear across all postcodes, but their frequency varies within each postcode.
# Distribution of Topics across Postcodes
## Get all the unique topics
topic_list = unique_subset_topics(df)
unique_topics = sorted(topic_list)
## Plot
topic_feature_bar_distribution(
df_=df, feature_='postcode', yrange=[0, 2.5],
topic_list_=unique_topics,
subplot_titles_=[
'<b>Big City<b>', '<b>Metropolitan<b>', '<b>Municiplaity<b>',
'<b>Provincial<b>', '<b>Rural District<b>'
],
xaxis_title='<b>Topics<b>', yaxis_title='<b>Count<b>',
title_='<b>Topic Distribution per Postcodes<b>',
height_=850, width_=1000
)
What is the daily and hourly activity for postcodes?
Solution
The daily and hourly activity trajectories are similar across each postcode.
For more analysis, check out my notebook containing the code!
Stay tuned for my next post which will go over the model selection for Recommendation Systems!