Part I.a: MovieLens Dataset Exploration

Understanding User Behavior and Rating Patterns

Introduction

Before building recommendation systems, we need to understand our data. This notebook explores the MovieLens dataset to uncover:

  1. Dataset Structure - Movies, ratings, tags, and links
  2. Rating Patterns - Distribution, trends over time, by genre/year
  3. Popularity Analysis - Which movies are popular? Does popularity correlate with quality?
  4. User Behavior - Activity levels, engagement patterns, cold start challenges
  5. Example Users & Movies - Define consistent examples for later notebooks

Why This Matters:

  • Understanding data characteristics helps us choose appropriate algorithms
  • User behavior patterns (cold start, engagement) motivate different recommendation approaches
  • Rating distributions affect how we interpret predictions
  • Temporal patterns are crucial for sequential models (Part II)
Show code
from pathlib import Path

import numpy as np
import pandas as pd
import polars as pl
from IPython.display import Markdown, display
from mizani.formatters import comma_format
from plotnine import *
from sklearn.decomposition import PCA

from recsys_genai.data_utils import load_movielens
Show code
theme_set(
    theme_minimal()
    + theme(
        plot_title=element_text(weight="bold", size=14),
        axis_title=element_text(size=12),
        figure_size=(8, 6),
    )
)

Load MovieLens Data

Show code
movies, ratings, tags, links = load_movielens("../data")

dataset_summary = f"""
**Dataset Summary:**

- **Movies:** {len(movies):,}
- **Ratings:** {len(ratings):,}
- **Users:** {ratings["user_id"].n_unique():,}
- **Tags:** {len(tags):,}
- **Links:** {len(links):,}
- **Sparsity:** {len(ratings) / (ratings["user_id"].n_unique() * len(movies)) * 100:.4f}%
"""

display(Markdown(dataset_summary))

Dataset Summary:

  • Movies: 86,537
  • Ratings: 33,832,162
  • Users: 330,975
  • Tags: 2,328,315
  • Links: 86,537
  • Sparsity: 0.1181%

Inspect Data

Movies Dataset

Show code
movies_stats = f"""
**Movies Dataset Overview:**

- Total movies: {len(movies):,}
- Columns: {", ".join(movies.columns)}
- Unique genres: {len(set([g for genres in movies["genres"].to_list() for g in genres if g != "(no genres listed)"])):,}
"""
display(Markdown(movies_stats))
display(movies.head(10))

Movies Dataset Overview:

  • Total movies: 86,537
  • Columns: movie_id, title, genres
  • Unique genres: 19
shape: (10, 3)
movie_id title genres
i64 str list[str]
1 "Toy Story (1995)" ["Adventure", "Animation", … "Fantasy"]
2 "Jumanji (1995)" ["Adventure", "Children", "Fantasy"]
3 "Grumpier Old Men (1995)" ["Comedy", "Romance"]
4 "Waiting to Exhale (1995)" ["Comedy", "Drama", "Romance"]
5 "Father of the Bride Part II (1… ["Comedy"]
6 "Heat (1995)" ["Action", "Crime", "Thriller"]
7 "Sabrina (1995)" ["Comedy", "Romance"]
8 "Tom and Huck (1995)" ["Adventure", "Children"]
9 "Sudden Death (1995)" ["Action"]
10 "GoldenEye (1995)" ["Action", "Adventure", "Thriller"]

Ratings Dataset

Show code
avg_rating = ratings["rating"].mean()
ratings_stats = f"""
**Ratings Dataset Overview:**

- Total ratings: {len(ratings):,}
- Average rating: {avg_rating:.2f}
- Rating range: {ratings["rating"].min():.1f} - {ratings["rating"].max():.1f}
- Time span: {ratings["timestamp"].min()} to {ratings["timestamp"].max()}
"""
display(Markdown(ratings_stats))
display(ratings.head(10))

Ratings Dataset Overview:

  • Total ratings: 33,832,162
  • Average rating: 3.54
  • Rating range: 0.5 - 5.0
  • Time span: 1995-01-09 11:46:44 to 2023-07-20 08:53:33
shape: (10, 4)
user_id movie_id rating timestamp
i64 i64 f64 datetime[μs]
1 1 4.0 2008-11-03 17:52:19
1 110 4.0 2008-11-05 06:04:46
1 158 4.0 2008-11-03 17:31:43
1 260 4.5 2008-11-03 18:00:04
1 356 5.0 2008-11-03 17:58:39
1 381 3.5 2008-11-03 17:41:45
1 596 4.0 2008-11-03 17:32:04
1 1036 5.0 2008-11-03 18:07:06
1 1049 3.0 2008-11-03 17:41:19
1 1066 4.0 2008-11-03 18:29:21

Tags Dataset

Show code
tags_stats = f"""
**Tags Dataset Overview:**

- Total tags: {len(tags):,}
- Unique tags: {tags["tag"].n_unique():,}
- Users who tagged: {tags["user_id"].n_unique():,}
- Movies tagged: {tags["movie_id"].n_unique():,}
"""
display(Markdown(tags_stats))
display(tags.head(10))

Tags Dataset Overview:

  • Total tags: 2,328,315
  • Unique tags: 153,951
  • Users who tagged: 25,280
  • Movies tagged: 53,452
shape: (10, 4)
user_id movie_id tag timestamp
i64 i64 str datetime[μs]
10 260 "good vs evil" 2015-05-03 15:22:38
10 260 "Harrison Ford" 2015-05-03 15:21:45
10 260 "sci-fi" 2015-05-03 15:22:18
14 1221 "Al Pacino" 2011-07-25 13:32:36
14 1221 "mafia" 2011-07-25 13:32:26
14 58559 "Atmospheric" 2011-07-24 18:00:39
14 58559 "Batman" 2011-07-24 17:59:51
14 58559 "comic book" 2011-07-24 17:59:58
14 58559 "dark" 2011-07-24 18:00:28
14 58559 "Heath Ledger" 2011-07-24 18:00:04

Movie Posters

Show code
posters = pl.read_parquet("../data/shared/posters.parquet")

# Calculate coverage by joining with links
movies_with_posters = links.select(["movie_id", "tmdb_id"]).join(posters, on="tmdb_id", how="inner")

posters_stats = f"""
**Posters Dataset Overview:**

- Total posters: {len(posters):,}
- Movies with posters: {movies_with_posters["movie_id"].n_unique():,}
- Coverage: {len(movies_with_posters) / len(movies) * 100:.1f}% of movies have poster data
"""
display(Markdown(posters_stats))

Posters Dataset Overview:

  • Total posters: 84,676
  • Movies with posters: 84,712
  • Coverage: 97.9% of movies have poster data

Let’s display a few movie posters from popular movies:

Show code
from recsys_genai.notebook_utils import tmdb_images

# Get some popular highly-rated movies
popular_movies = (
    ratings.group_by("movie_id")
    .agg([pl.count("rating").alias("num_ratings"), pl.mean("rating").alias("avg_rating")])
    .filter(pl.col("num_ratings") >= 100)
    .filter(pl.col("avg_rating") >= 4.0)
    .sort("num_ratings", descending=True)
    .head(12)
    .join(movies, on="movie_id")
    .join(links.select(["movie_id", "tmdb_id"]), on="movie_id")
    .join(posters, on="tmdb_id", how="inner")
)

# Get poster paths
poster_paths = popular_movies["poster_path"].to_list()

print(f"Displaying {len(poster_paths)} movie posters from popular highly-rated films:")
print()
print("  ".join([f"{row['title']}" for row in popular_movies.head(6).to_dicts()]))
print()

tmdb_images(poster_paths[:6])
Displaying 12 movie posters from popular highly-rated films:

Silence of the Lambs, The (1991)  Star Wars: Episode V - The Empire Strikes Back (1980)  Lord of the Rings: The Fellowship of the Ring, The (2001)  Pulp Fiction (1994)  Forrest Gump (1994)  Shawshank Redemption, The (1994)

Rating Distribution

What ratings do users give?

Show code
# Create rating distribution histogram
(
    ggplot(ratings, aes(x="rating"))
    + geom_histogram(binwidth=0.5, fill="steelblue", alpha=0.7)
    + labs(title="Rating Distribution", x="Rating", y="Count")
)

Observation: Most ratings are positive (3.5-5.0). This creates implicit “thumbs up” when rating ≥ 4.0.

Average Rating by Release Year

How do ratings vary by movie release year?

Show code
# Extract year from movie title and calculate average rating
movies_with_year = movies.with_columns(
    [pl.col("title").str.extract(r"\((\d{4})\)", 1).cast(pl.Int32).alias("year")]
).filter(pl.col("year").is_not_null())

# Join with ratings and calculate average rating per year
ratings_by_year = (
    ratings.join(movies_with_year.select(["movie_id", "year"]), on="movie_id")
    .group_by("year")
    .agg([pl.mean("rating").alias("avg_rating"), pl.len().alias("num_ratings")])
    .filter(pl.col("num_ratings") >= 100)  # Filter years with at least 100 ratings
    .sort("year")
)

year_stats = f"""
**Average Rating by Release Year:**

- Years analyzed: {len(ratings_by_year)}
- Highest rated year: {ratings_by_year.sort("avg_rating", descending=True)["year"][0]} (avg: {ratings_by_year.sort("avg_rating", descending=True)["avg_rating"][0]:.2f})
- Lowest rated year: {ratings_by_year.sort("avg_rating")["year"][0]} (avg: {ratings_by_year.sort("avg_rating")["avg_rating"][0]:.2f})
"""
display(Markdown(year_stats))

Average Rating by Release Year:

  • Years analyzed: 130
  • Highest rated year: 1957 (avg: 4.00)
  • Lowest rated year: 1894 (avg: 2.51)
Show code
# Create bar chart
(
    ggplot(ratings_by_year, aes(x="year", y="avg_rating"))
    + geom_col(fill="steelblue", alpha=0.7)
    + geom_hline(yintercept=ratings["rating"].mean(), linetype="dashed", color="red", alpha=0.7)
    + labs(
        title="Average Rating by Movie Release Year",
        x="Release Year",
        y="Average Rating",
        caption="Red dashed line: Overall mean rating | Only years with ≥100 ratings shown",
    )
    + ylim(0, 5)
    + theme(axis_text_x=element_text(rotation=45, hjust=1))
)

Key Insight: Older movies tend to have slightly higher average ratings, likely due to survivorship bias - only the best old movies remain popular enough to be rated.

Average Rating by Genre

How do ratings vary across different genres?

Show code
# Extract first genre and calculate average rating
movies_with_genre = movies.with_columns(
    [pl.col("genres").list.first().alias("primary_genre")]
).filter(
    (pl.col("primary_genre").is_not_null()) & (pl.col("primary_genre") != "(no genres listed)")
)

# Join with ratings and calculate average rating per genre
ratings_by_genre = (
    ratings.join(movies_with_genre.select(["movie_id", "primary_genre"]), on="movie_id")
    .group_by("primary_genre")
    .agg([pl.mean("rating").alias("avg_rating"), pl.len().alias("num_ratings")])
    .filter(pl.col("num_ratings") >= 100)  # Filter genres with at least 100 ratings
    .sort("avg_rating", descending=True)
)

genre_stats = f"""
**Average Rating by Primary Genre:**

- Genres analyzed: {len(ratings_by_genre)}
- Highest rated genre: {ratings_by_genre["primary_genre"][0]} (avg: {ratings_by_genre["avg_rating"][0]:.2f})
- Lowest rated genre: {ratings_by_genre.sort("avg_rating")["primary_genre"][0]} (avg: {ratings_by_genre.sort("avg_rating")["avg_rating"][0]:.2f})
"""
display(Markdown(genre_stats))

Average Rating by Primary Genre:

  • Genres analyzed: 18
  • Highest rated genre: Film-Noir (avg: 4.02)
  • Lowest rated genre: Horror (avg: 3.16)
Show code
# Create horizontal bar chart
(
    ggplot(ratings_by_genre, aes(x="reorder(primary_genre, avg_rating)", y="avg_rating"))
    + geom_col(fill="steelblue", alpha=0.7)
    + geom_hline(yintercept=ratings["rating"].mean(), linetype="dashed", color="red", alpha=0.7)
    + coord_flip()
    + labs(
        title="Average Rating by Primary Genre",
        x="Genre",
        y="Average Rating",
        caption="Red dashed line: Overall mean rating | Only genres with ≥100 ratings shown",
    )
    + ylim(0, 5)
)

Key Insight: Film-Noir, Mystery and Crime films tend to receive higher ratings, while Horror and Romance tend to be rated lower. This reflects both audience preferences and the quality distribution within genres.

Popularity Bias

Which movies are most popular?

Show code
# Calculate movie popularity
top_n = 20
popularity = (
    ratings.group_by("movie_id")
    .agg(pl.len().alias("count"))
    .join(movies, on="movie_id")
    .sort("count", descending=True)
    .head(top_n)
)

# Create horizontal bar chart
(
    ggplot(popularity, aes(x="reorder(title, count)", y="count"))
    + geom_col(fill="steelblue", alpha=0.7)
    + coord_flip()
    + labs(title=f"Top {top_n} Most Popular Movies", x="Movie", y="Number of Ratings")
)

Key Insight: Heavy concentration in blockbusters - the “long tail” problem!

Popularity Bias Analysis

Does popularity correlate with quality? Let’s investigate.

Show code
# Calculate popularity and average rating for each movie
movie_stats = (
    ratings.group_by("movie_id")
    .agg([pl.len().alias("num_ratings"), pl.mean("rating").alias("avg_rating")])
    .join(movies.select(["movie_id", "title"]), on="movie_id")
)

# Filter movies with at least 10 ratings to reduce noise
movie_stats_filtered = movie_stats.filter(pl.col("num_ratings") >= 10)

bias_stats = f"""
**Popularity Bias Statistics:**

- Movies with ≥10 ratings: {len(movie_stats_filtered):,}
- Correlation (popularity vs. avg rating): {movie_stats_filtered.select(pl.corr("num_ratings", "avg_rating"))[0, 0]:.3f}
- Most popular movie: {movie_stats.sort("num_ratings", descending=True)["title"][0]} ({movie_stats.sort("num_ratings", descending=True)["num_ratings"][0]:,} ratings)
- Highest rated (≥100 ratings): {movie_stats.filter(pl.col("num_ratings") >= 100).sort("avg_rating", descending=True)["title"][0]} (avg: {movie_stats.filter(pl.col("num_ratings") >= 100).sort("avg_rating", descending=True)["avg_rating"][0]:.2f})
"""
display(Markdown(bias_stats))

Popularity Bias Statistics:

  • Movies with ≥10 ratings: 32,021
  • Correlation (popularity vs. avg rating): 0.180
  • Most popular movie: Shawshank Redemption, The (1994) (122,296 ratings)
  • Highest rated (≥100 ratings): Planet Earth II (2016) (avg: 4.45)
Show code
# Prepare data
plot_data = movie_stats_filtered

# Create scatter plot with LOWESS smoothing
(
    ggplot(plot_data, aes(x="num_ratings", y="avg_rating"))
    + geom_point(alpha=0.15, size=0.8, color="steelblue")
    + geom_smooth(method="lowess", color="red", size=1.5, se=False, span=0.3)
    + geom_hline(
        yintercept=plot_data["avg_rating"].mean(),
        linetype="dashed",
        color="orange",
        size=1,
        alpha=0.7,
    )
    + labs(
        title="Popularity Bias: Number of Ratings vs. Average Rating",
        x="Number of Ratings (log scale)",
        y="Average Rating",
        caption="Red line: LOWESS smoothing | Orange dashed: Overall mean rating",
    )
    + scale_x_continuous(trans="log10")
    + ylim(0, 5)
)

Key Insight: Popular movies tend to have slightly higher ratings, but the effect is modest. This suggests: - Selection bias: people rate movies they expect to like - Quality does matter: truly bad movies don’t get many ratings - Recommendation challenge: balancing popularity with personalization

User Activity Distribution

Show code
# Calculate user activity (ratings per user)
user_counts = ratings.group_by("user_id").agg(pl.len().alias("num_ratings"))

# Create histogram with log scale
(
    ggplot(user_counts, aes(x="num_ratings"))
    + geom_histogram(bins=50, fill="steelblue", alpha=0.7)
    + scale_x_log10()
    + labs(
        title="User Activity Distribution", x="Number of Ratings (log scale)", y="Number of Users"
    )
)

Observation: Power law distribution - few very active users, many casual users.

User Engagement Over Time

How long do users stay engaged with the platform?

Show code
# Calculate user engagement metrics
user_engagement = ratings.group_by("user_id").agg(
    [
        pl.len().alias("num_ratings"),
        pl.min("timestamp").alias("first_rating"),
        pl.max("timestamp").alias("last_rating"),
    ]
)

# Calculate time difference in days
user_engagement = user_engagement.with_columns(
    [
        ((pl.col("last_rating") - pl.col("first_rating")).dt.total_seconds() / (24 * 3600)).alias(
            "engagement_days"
        )
    ]
)

# Filter users with at least 2 ratings
user_engagement = user_engagement.filter(pl.col("num_ratings") > 1)

avg_days = float(user_engagement["engagement_days"].mean())
median_days = float(user_engagement["engagement_days"].median())
max_days = float(user_engagement["engagement_days"].max())

engagement_stats = f"""
**User Engagement Metrics:**

- Users analyzed: {len(user_engagement):,}
- Average engagement period: {avg_days:.1f} days
- Median engagement period: {median_days:.1f} days
- Max engagement period: {max_days:.0f} days
"""
display(Markdown(engagement_stats))

User Engagement Metrics:

  • Users analyzed: 322,397
  • Average engagement period: 166.3 days
  • Median engagement period: 0.0 days
  • Max engagement period: 9437 days
Show code
# Prepare data for plotting
plot_data = user_engagement

# Define time scale breakpoints and labels
# 1 minute, 1 hour, 1 day, 1 week, 1 month, 1 year, 10 years
time_breaks = [
    1 / 1440,  # 1 minute (1 day / 1440 minutes)
    1 / 24,  # 1 hour
    1,  # 1 day
    7,  # 1 week
    30,  # 1 month (~30 days)
    365,  # 1 year
    3650,  # 10 years
]

time_labels = ["1 min", "1 hour", "1 day", "1 week", "1 month", "1 year", "10 years"]

# Create scatter plot with LOWESS smoothing
(
    ggplot(plot_data, aes(x="num_ratings", y="engagement_days"))
    + geom_point(alpha=0.1, size=0.5, color="steelblue")
    + geom_smooth(method="lowess", color="red", size=1.5, se=False, span=0.1)
    + labs(
        title="User Engagement: Number of Ratings vs. Time Active",
        x="Number of Ratings (log scale)",
        y="Time Between First and Last Rating",
        caption="Red line: LOWESS smoothing",
    )
    + scale_x_continuous(trans="log10", labels=comma_format())
    + scale_y_continuous(trans="log10", breaks=time_breaks, labels=time_labels)
)
/home/jkr17/recsys-genai/.venv/lib/python3.13/site-packages/pandas/core/arraylike.py:402: RuntimeWarning: divide by zero encountered in log10

Key Insight: More active users tend to stay engaged longer, but the relationship isn’t perfectly linear - some users rate many movies in short bursts.

Rating Velocity

How quickly do users rate after joining?

Show code
# Calculate rating velocity (ratings per day) for each user
user_velocity = user_engagement.with_columns(
    [(pl.col("num_ratings") / (pl.col("engagement_days") + 1)).alias("ratings_per_day")]
).filter(pl.col("engagement_days") > 0)

velocity_stats = f"""
**Rating Velocity:**

- Average velocity: {user_velocity["ratings_per_day"].mean():.2f} ratings/day
- Median velocity: {user_velocity["ratings_per_day"].median():.2f} ratings/day
- Users rating >1 per day: {len(user_velocity.filter(pl.col("ratings_per_day") > 1)):,} ({len(user_velocity.filter(pl.col("ratings_per_day") > 1)) / len(user_velocity) * 100:.1f}%)
"""
display(Markdown(velocity_stats))

Rating Velocity:

  • Average velocity: 34.79 ratings/day
  • Median velocity: 15.65 ratings/day
  • Users rating >1 per day: 275,853 (86.3%)
Show code
# Show top velocity users
display(Markdown("**Top 10 Most Active Users (by velocity):**"))
display(
    user_velocity.sort("ratings_per_day", descending=True)
    .head(10)
    .select(["user_id", "num_ratings", "engagement_days", "ratings_per_day"])
)

Top 10 Most Active Users (by velocity):

shape: (10, 4)
user_id num_ratings engagement_days ratings_per_day
i64 u32 f64 f64
44970 5525 0.00375 5504.358655
295202 4843 0.001042 4837.960458
199388 4109 0.000926 4105.19889
34677 3934 0.013819 3880.375368
14404 3865 0.001979 3857.365631
185341 3712 0.000532 3710.024755
50012 3420 0.000521 3418.219677
174815 3389 0.002187 3381.602744
88095 3371 0.004954 3354.383379
77647 3280 0.002292 3272.50052

Observation: Rating velocity varies dramatically - some users binge-rate, others rate occasionally. This temporal pattern is crucial for sequential models (Part II).

Examples

Define Example Movies

Let’s select a diverse set of popular movies to use as examples throughout our analysis.

Show code
# Define example movies - a diverse set across genres and eras
EXAMPLE_MOVIE_IDS = [
    1,  # Toy Story (1995) - Animation/Children's
    2,  # Jumanji (1995) - Adventure/Fantasy
    32,  # Twelve Monkeys (1995) - Sci-Fi/Thriller
    110,  # Braveheart (1995) - Action/Drama/War
    260,  # Star Wars: Episode IV (1977) - Action/Adventure/Sci-Fi
    296,  # Pulp Fiction (1994) - Crime/Drama
    318,  # Shawshank Redemption (1994) - Crime/Drama
    356,  # Forrest Gump (1994) - Comedy/Drama/Romance
]

# Display the example movies
example_movies = movies.filter(pl.col("movie_id").is_in(EXAMPLE_MOVIE_IDS))
example_stats = f"""
**Example Movies for Analysis:**

- Total selected: {len(example_movies)}
- These movies will be highlighted in visualizations throughout the workshop
"""
display(Markdown(example_stats))
display(example_movies)

Example Movies for Analysis:

  • Total selected: 8
  • These movies will be highlighted in visualizations throughout the workshop
shape: (8, 3)
movie_id title genres
i64 str list[str]
1 "Toy Story (1995)" ["Adventure", "Animation", … "Fantasy"]
2 "Jumanji (1995)" ["Adventure", "Children", "Fantasy"]
32 "Twelve Monkeys (a.k.a. 12 Monk… ["Mystery", "Sci-Fi", "Thriller"]
110 "Braveheart (1995)" ["Action", "Drama", "War"]
260 "Star Wars: Episode IV - A New … ["Action", "Adventure", "Sci-Fi"]
296 "Pulp Fiction (1994)" ["Comedy", "Crime", … "Thriller"]
318 "Shawshank Redemption, The (199… ["Crime", "Drama"]
356 "Forrest Gump (1994)" ["Comedy", "Drama", … "War"]
Show code
# Get statistics for our example movies
example_movie_stats = (
    ratings.filter(pl.col("movie_id").is_in(EXAMPLE_MOVIE_IDS))
    .group_by("movie_id")
    .agg([pl.len().alias("rating_count"), pl.mean("rating").alias("avg_rating")])
    .join(movies, on="movie_id")
    .sort("rating_count", descending=True)
)

display(Markdown("**Statistics for Example Movies:**"))
display(example_movie_stats.select(["title", "rating_count", "avg_rating"]))

Statistics for Example Movies:

shape: (8, 3)
title rating_count avg_rating
str u32 f64
"Shawshank Redemption, The (199… 122296 4.416792
"Forrest Gump (1994)" 113581 4.068189
"Pulp Fiction (1994)" 108756 4.191778
"Star Wars: Episode IV - A New … 97202 4.0924
"Toy Story (1995)" 76813 3.893508
"Braveheart (1995)" 75514 3.996166
"Twelve Monkeys (a.k.a. 12 Monk… 59730 3.896593
"Jumanji (1995)" 30209 3.278179

Meet Alice

Our running example throughout the workshop!

Show code
alice_id = 1
alice_ratings = (
    ratings.filter(pl.col("user_id") == alice_id).join(movies, on="movie_id").sort("timestamp")
)

alice_stats = f"""
**Alice's Profile (User {alice_id}):**

- Total movies rated: {len(alice_ratings)}
- Average rating: {alice_ratings["rating"].mean():.2f}
- Favorite genres: {", ".join([g for genres in alice_ratings.sort("rating", descending=True).head(10)["genres"].to_list() for g in genres][:5])}
- Rating distribution: {alice_ratings["rating"].value_counts().sort("rating").to_dict()}
"""
display(Markdown(alice_stats))

Alice’s Profile (User 1):

  • Total movies rated: 62
  • Average rating: 4.01
  • Favorite genres: Drama, Romance, Action, Adventure, Adventure
  • Rating distribution: {‘rating’: shape: (7,) Series: ‘rating’ [f64] [ 2.0 2.5 3.0 3.5 4.0 4.5 5.0], ‘count’: shape: (7,) Series: ‘count’ [u32] [ 1 1 8 11 21 5 15]}
Show code
display(Markdown("**ALICE'S RATINGS (chronological, first 10):**"))
display(alice_ratings.select(["title", "rating", "genres", "timestamp"]).head(10))

ALICE’S RATINGS (chronological, first 10):

shape: (10, 4)
title rating genres timestamp
str f64 list[str] datetime[μs]
"Casper (1995)" 4.0 ["Adventure", "Children"] 2008-11-03 17:31:43
"Harry Potter and the Sorcerer'… 4.0 ["Adventure", "Children", "Fantasy"] 2008-11-03 17:31:56
"Pinocchio (1940)" 4.0 ["Animation", "Children", … "Musical"] 2008-11-03 17:32:04
"Sneakers (1992)" 3.0 ["Action", "Comedy", … "Sci-Fi"] 2008-11-03 17:32:14
"Star Trek IV: The Voyage Home … 3.0 ["Adventure", "Comedy", "Sci-Fi"] 2008-11-03 17:32:19
"X-Files: Fight the Future, The… 3.0 ["Action", "Crime", … "Thriller"] 2008-11-03 17:35:17
"Elizabeth (1998)" 3.5 ["Drama"] 2008-11-03 17:37:11
"Gandhi (1982)" 2.0 ["Drama"] 2008-11-03 17:37:22
"American Graffiti (1973)" 3.0 ["Comedy", "Drama"] 2008-11-03 17:39:15
"Last Emperor, The (1987)" 4.0 ["Drama"] 2008-11-03 17:39:25
Show code
# Alice's favorites: top 10 highest-rated movies (consistent definition across notebooks)
alice_favorites = (
    ratings.filter(pl.col("user_id") == alice_id)
    .top_k(10, by="rating")
    .join(movies, on="movie_id")
    .select("title", "genres")
)

alice_top_count = len(alice_ratings.filter(pl.col("rating") >= 4.5))
display(
    Markdown(
        f"**ALICE'S FAVORITE MOVIES (top 10 by rating, from {alice_top_count} with rating >= 4.5):**"
    )
)
display(alice_favorites)

ALICE’S FAVORITE MOVIES (top 10 by rating, from 20 with rating >= 4.5):

shape: (10, 2)
title genres
str list[str]
"Forrest Gump (1994)" ["Comedy", "Drama", … "War"]
"Die Hard (1988)" ["Action", "Crime", "Thriller"]
"Indiana Jones and the Last Cru… ["Action", "Adventure"]
"Saving Private Ryan (1998)" ["Action", "Drama", "War"]
"Sixth Sense, The (1999)" ["Drama", "Horror", "Mystery"]
"Gladiator (2000)" ["Action", "Adventure", "Drama"]
"Monsters, Inc. (2001)" ["Adventure", "Animation", … "Fantasy"]
"Beautiful Mind, A (2001)" ["Drama", "Romance"]
"Lord of the Rings: The Return … ["Action", "Adventure", … "Fantasy"]
"Disturbia (2007)" ["Drama", "Thriller"]

Summary and Next Steps

We’ve explored the MovieLens dataset and uncovered several key insights:

  1. Rating Patterns: Ratings are mostly positive (3.5-5.0), stable over time, and vary by genre/year
  2. Popularity Bias: Popular movies get slightly higher ratings, but the correlation is modest
  3. User Behavior: Power-law distribution of activity, with many cold-start users
  4. Temporal Patterns: Engagement varies widely - some users binge-rate, others rate slowly
  5. Example Data: We’ve defined example movies and Alice (User 1) for use throughout the workshop

In the next notebook (01b_foundations.qmd), we’ll build recommendation systems using:

  • Matrix Factorization: Learn latent factors from user-item interactions
  • EASE: Simple yet powerful item-item collaborative filtering
  • Content vs. Collaborative: Compare different approaches

These insights will guide our choice of algorithms and help us understand model behavior!