Part III.a: LLMs for Recommendation

Zero-Shot Recommendation and Metadata Generation

Introduction

This notebook explores how Large Language Models can enhance recommendation systems:

Cold-Start Problem: Recommend new items without interaction data
Zero-Shot Recommendation: Use LLM knowledge to suggest items
Metadata Generation: Enrich item catalogs with LLM-generated descriptions
Semantic Search: Embed items and queries in shared space

Key Papers:

Geng et al. (2022) introduced P5, treating recommendation as language processing
Zhang et al. (2023) developed collaborative approaches for LLM-based recommendation
Hou et al. (2023) demonstrated LLMs as zero-shot rankers for recommender systems

Show code

import itertools
import json
import os

import numpy as np
import polars as pl
from IPython.display import Markdown, display
from plotnine import *
from sklearn.metrics.pairwise import cosine_similarity

from recsys_genai.data_utils import load_movielens
from recsys_genai.llm_utils import (
    check_ollama_available,
    ollama_embed,
    ollama_generate,
    ollama_generate_json,
    retry,
)
from recsys_genai.notebook_utils import (
    ollama_model_link,
    show_prompt,
    show_response,
    tmdb_images,
)

Show code

theme_set(
    theme_minimal()
    + theme(
        plot_title=element_text(weight="bold", size=14),
        axis_title=element_text(size=12),
        figure_size=(8, 6),
    )
)

pl.Config(
    fmt_str_lengths=50,
    tbl_rows=20,
)

LLM_MODEL = os.getenv("LLM_MODEL", "ministral-3:3b")
EMBED_MODEL = os.getenv("EMBED_MODEL", "nomic-embed-text-v2-moe")

display(
    Markdown(
        f"""
**Selected models:**

- Generative: {ollama_model_link(LLM_MODEL)}
- Embedding: {ollama_model_link(EMBED_MODEL)}
"""
    )
)

Selected models:

Generative: ministral-3:14b 🔗
Embedding: nomic-embed-text-v2-moe 🔗

Show code

movies, ratings, tags, links = load_movielens("../data")

The Cold-Start Problem

Challenge: How do we recommend a brand new movie that no one has rated yet?

Traditional CF methods fail: no interactions → no embeddings!

Example: New Movie Arrives

Show code

# Simulate a new movie with only metadata
new_movie = {
    "movie_id": 999999,
    "title": "Elemental (2023)",
    "genres": ["Animation", "Comedy", "Fantasy"],
    "description": (
        "In a city where fire, water, land and air residents live together, "
        "a fiery young woman and a go-with-the-flow guy discover something "
        "elemental: how much they actually have in common."
    ),
}

display(Markdown("**New Movie:**"))
for k, v in new_movie.items():
    display(Markdown(f"- **{k}**: {v}"))

New Movie:

movie_id: 999999

title: Elemental (2023)

genres: [‘Animation’, ‘Comedy’, ‘Fantasy’]

description: In a city where fire, water, land and air residents live together, a fiery young woman and a go-with-the-flow guy discover something elemental: how much they actually have in common.

Question: Who should we recommend this to?

Traditional answer: Wait for users to rate it → cold start lag!

LLM answer: Use semantic understanding of genres and description!

Zero-Shot Recommendation with LLMs

Approach 1: Prompt-Based Ranking

We can ask an LLM to rank items based on user preferences!

Show code

# Create the ranking prompt
ranking_prompt = """You are a movie recommendation expert.

User Profile:
- Loved: Toy Story, Finding Nemo, Up, Inside Out
- Preferred genres: Animation, Family, Comedy

New Movies to Consider:
- The Drama (2026) - Romance
- War Machine (2026) - Action, Sci-fi
- Ice Age: Boiling Point (2026) - Animation, Adventure, Comedy

Task: Rank these movies for this user (1 = best match).

Output format:
1. [Movie Title] - [Reason]
2. [Movie Title] - [Reason]
3. [Movie Title] - [Reason]
"""

display(show_prompt(ranking_prompt))

# Call the LLM
llm_ranking_response = ollama_generate(
    ranking_prompt,
    model=LLM_MODEL,
)
display(Markdown("**LLM Ranking Response:**"))
show_response(llm_ranking_response)

Prompt:

You are a movie recommendation expert.

User Profile:
- Loved: Toy Story, Finding Nemo, Up, Inside Out
- Preferred genres: Animation, Family, Comedy

New Movies to Consider:
- The Drama (2026) - Romance
- War Machine (2026) - Action, Sci-fi
- Ice Age: Boiling Point (2026) - Animation, Adventure, Comedy

Task: Rank these movies for this user (1 = best match).

Output format:
1. [Movie Title] - [Reason]
2. [Movie Title] - [Reason]
3. [Movie Title] - [Reason]

LLM Ranking Response:

LLM Response:

1. **Ice Age: Boiling Point (2026)** - **Animation, Adventure, Comedy**
   - *Reason*: This is the **best match** for your profile! Given your love for *Toy Story*, *Up*, and *Inside Out*—all beloved animated films with heart, humor, and adventure—this sequel fits perfectly. The *Ice Age* series has always balanced comedy, family-friendly storytelling, and dynamic animation, making it a strong contender.

2. **The Drama (2026)** - **Romance**
   - *Reason*: While this isn’t a clear fit for your preferred genres (animation, family, comedy), it could still appeal to you if you enjoy emotionally engaging stories with lighthearted or whimsical elements. Romance films like *Beauty and the Beast* (animated) or *The Princess Bride* (family-friendly adventure) might share some thematic overlap. However, since it lacks animation or comedy, it ranks lower.

3. **War Machine (2026)** - **Action, Sci-fi**
   - *Reason*: This is the **least aligned** with your profile. While it might have comedic elements or action-adventure appeal, it’s not animated or family-oriented, and its genre leans heavily toward mature themes. Unless you’re open to trying something outside your usual preferences, this wouldn’t be a top recommendation.

Show code

# Create the ranking prompt
ranking_prompt = """You are a movie recommendation expert.

User Profile:
- Loved: Toy Story, Finding Nemo, Up, Inside Out
- Preferred genres: Animation, Family, Comedy

New Movies to Consider:
- The Drama (2026) - Romance
- War Machine (2026) - Action, Sci-fi
- Ice Age: Boiling Point (2026) - Animation, Adventure, Comedy

Task: Rank these movies for this user (1 = best match).

Output format:
[
    {"rank": 1, "title": "[Movie Title]", "reason": "[Reason]"},
    {"rank": 2, "title": "[Movie Title]", "reason": "[Reason]"},
    {"rank": 3, "title": "[Movie Title]", "reason": "[Reason]"}
]
"""

display(show_prompt(ranking_prompt))

# Call the LLM
llm_ranking_response = ollama_generate_json(
    ranking_prompt,
    model=LLM_MODEL,
)
show_response(llm_ranking_response)

Prompt:

You are a movie recommendation expert.

User Profile:
- Loved: Toy Story, Finding Nemo, Up, Inside Out
- Preferred genres: Animation, Family, Comedy

New Movies to Consider:
- The Drama (2026) - Romance
- War Machine (2026) - Action, Sci-fi
- Ice Age: Boiling Point (2026) - Animation, Adventure, Comedy

Task: Rank these movies for this user (1 = best match).

Output format:
[
    {"rank": 1, "title": "[Movie Title]", "reason": "[Reason]"},
    {"rank": 2, "title": "[Movie Title]", "reason": "[Reason]"},
    {"rank": 3, "title": "[Movie Title]", "reason": "[Reason]"}
]

LLM Response:

[
  {
    "rank": 1,
    "title": "Ice Age: Boiling Point (2026)",
    "reason": "Perfect match for the user's profile! It's an **animation** film with **adventure** and **comedy** elements\u2014aligning closely with their love for *Toy Story*, *Finding Nemo*, and *Up*. The franchise\u2019s humor and heartwarming family dynamics are likely to resonate strongly."
  },
  {
    "rank": 2,
    "title": "The Drama (2026)",
    "reason": "While not a direct match for their preferred genres, the user\u2019s profile suggests they enjoy **emotional, character-driven stories** (e.g., *Up*, *Inside Out*). If *The Drama* leans into **lighthearted romance** with comedic or family-friendly undertones, it could be a fun departure. However, without confirmation of its tone, it\u2019s riskier than *Ice Age*."
  },
  {
    "rank": 3,
    "title": "War Machine (2026)",
    "reason": "Doesn\u2019t align with the user\u2019s preferences at all. **Action/Sci-fi** lacks the **animation**, **family-friendly**, or **comedy** focus they enjoy. Even if it has humor, it\u2019s unlikely to appeal given their profile."
  }
]

Key Insight: LLM leverages semantic understanding of genres, themes, and patterns!

Approach 2: Embedding-Based Similarity

Modern LLMs can embed text into dense vectors. We can:

Embed movie descriptions with LLM
Embed user preference descriptions
Compute cosine similarity

LLM Embeddings with Ollama

Show code

# Select favorite movies
alice_id = 1
alice_favorites = (
    ratings.filter(pl.col("user_id") == alice_id)
    .top_k(10, by="rating")
    .join(movies, on="movie_id")
    .select("movie_id", "title", "genres")
)

alice_favorites

shape: (10, 3)

movie_id	title	genres
i64	str	list[str]
356	"Forrest Gump (1994)"	["Comedy", "Drama", … "War"]
1036	"Die Hard (1988)"	["Action", "Crime", "Thriller"]
1291	"Indiana Jones and the Last Crusade (1989)"	["Action", "Adventure"]
2028	"Saving Private Ryan (1998)"	["Action", "Drama", "War"]
2762	"Sixth Sense, The (1999)"	["Drama", "Horror", "Mystery"]
3578	"Gladiator (2000)"	["Action", "Adventure", "Drama"]
4886	"Monsters, Inc. (2001)"	["Adventure", "Animation", … "Fantasy"]
4995	"Beautiful Mind, A (2001)"	["Drama", "Romance"]
7153	"Lord of the Rings: The Return of the King, The (20…	["Action", "Adventure", … "Fantasy"]
52458	"Disturbia (2007)"	["Drama", "Thriller"]

Show code

# Load posters
posters = pl.read_parquet("../data/shared/posters.parquet")

# Get poster paths for Alice's favorites (join via links to get tmdb_id)
alice_with_posters = alice_favorites.join(
    links.select(["movie_id", "tmdb_id"]), on="movie_id"
).join(posters, on="tmdb_id", how="inner")
poster_paths = alice_with_posters["poster_path"].to_list()

display(Markdown(f"**Alice's Favorite Movies**:"))
tmdb_images(poster_paths)

Alice’s Favorite Movies:

Show code

# Build text representations
movie_texts = movies.with_columns(
    text=pl.format("{}: {}", "title", pl.col("genres").list.join(", "))
)

# User profile: concatenate favorite movie texts
alice_fav_ids = ratings.filter((pl.col("user_id") == alice_id) & (pl.col("rating") >= 4.5))[
    "movie_id"
].to_list()

alice_profile_texts = movie_texts.join(alice_favorites, on="movie_id", how="semi")["text"].to_list()

alice_profile = "\n".join(alice_profile_texts)  # Use first 5 to keep it concise
display(Markdown(f"**Alice's Profile:**\n\n{alice_profile}"))

Alice’s Profile:

Forrest Gump (1994): Comedy, Drama, Romance, War Die Hard (1988): Action, Crime, Thriller Indiana Jones and the Last Crusade (1989): Action, Adventure Saving Private Ryan (1998): Action, Drama, War Sixth Sense, The (1999): Drama, Horror, Mystery Gladiator (2000): Action, Adventure, Drama Monsters, Inc. (2001): Adventure, Animation, Children, Comedy, Fantasy Beautiful Mind, A (2001): Drama, Romance Lord of the Rings: The Return of the King, The (2003): Action, Adventure, Drama, Fantasy Disturbia (2007): Drama, Thriller

Show code

# Get list of most rated movies
# This ensures that the demo has familiar movies
most_rated_movies = ratings.group_by("movie_id").len("num_ratings").top_k(500, by="num_ratings")

Show code

# Exclude movies Alice has already rated
candidate_sample = movie_texts.join(most_rated_movies, on="movie_id", how="semi").join(
    alice_favorites, on="movie_id", how="anti"
)
candidate_texts = candidate_sample["text"].to_list()
candidate_ids = candidate_sample["movie_id"].to_list()

# Embed user profile
user_embedding = ollama_embed(alice_profile, model=EMBED_MODEL)

# Embed movies in batches for efficiency
batch_size = 50
movie_embeddings = []

for batch in itertools.batched(candidate_texts, batch_size):
    batch_embeddings = ollama_embed(batch, model=EMBED_MODEL)
    movie_embeddings.extend(batch_embeddings)

Show code

# Convert to numpy arrays
user_vec = np.array(user_embedding).reshape(1, -1)
item_vecs = np.array(movie_embeddings)

# Compute similarities
similarities = cosine_similarity(user_vec, item_vecs)[0]

# TODO(augment): display shapes as human readable (use markdown)
user_vec.shape, item_vecs.shape, similarities.shape

((1, 768), (491, 768), (491,))

Show code

# Top-10 recommendations
display(Markdown("**LLM Embedding-Based Recommendations:**"))
top_movies_llm = (
    pl.DataFrame({"movie_id": candidate_ids, "similarity": similarities})
    .join(movies, on="movie_id")
    .top_k(10, by="similarity")
)

top_movies_llm

LLM Embedding-Based Recommendations:

shape: (10, 4)

movie_id	similarity	title	genres
i64	f64	str	list[str]
2890	0.614356	"Three Kings (1999)"	["Action", "Adventure", … "War"]
2617	0.602196	"Mummy, The (1999)"	["Action", "Adventure", … "Thriller"]
485	0.599206	"Last Action Hero (1993)"	["Action", "Adventure", … "Fantasy"]
91500	0.579201	"The Hunger Games (2012)"	["Action", "Adventure", … "Thriller"]
6016	0.571565	"City of God (Cidade de Deus) (2002)"	["Action", "Adventure", … "Thriller"]
3052	0.570185	"Dogma (1999)"	["Adventure", "Comedy", "Fantasy"]
1197	0.568976	"Princess Bride, The (1987)"	["Action", "Adventure", … "Romance"]
1215	0.565744	"Army of Darkness (1993)"	["Action", "Adventure", … "Horror"]
7143	0.562921	"Last Samurai, The (2003)"	["Action", "Adventure", … "War"]
2115	0.562691	"Indiana Jones and the Temple of Doom (1984)"	["Action", "Adventure", "Fantasy"]

Show code

# Display posters for recommendations (join via links to get tmdb_id)
rec_with_posters = top_movies_llm.join(links.select(["movie_id", "tmdb_id"]), on="movie_id").join(
    posters, on="tmdb_id", how="inner"
)
rec_poster_paths = rec_with_posters["poster_path"].to_list()

display(Markdown(f"**Recommended for Alice**"))
tmdb_images(rec_poster_paths)

Recommended for Alice

Note: LLM embeddings capture semantic relationships between movies and user preferences!

Metadata Generation

LLMs can enrich sparse movie catalogs by generating:

Tags/Keywords: “heartwarming”, “visually stunning”
Mood Labels: “uplifting”, “intense”, “nostalgic”
Thematic Topics: “coming-of-age”, “family bonds”, “adventure”

Example Prompt for Metadata Generation

Show code

example_movie = movies.filter(pl.col("title").str.contains("Gump")).to_dicts()[0]
metadata_prompt = f"""\
You are a film analyst. Generate metadata for this movie:

Title: {example_movie["title"]}
Genres: {", ".join(example_movie["genres"])}

Generate:
1. Three thematic tags (e.g., friendship, adventure)
2. Mood label (one word)
3. Target audience (one phrase)
4. Similar movie archetypes

Output ONLY valid JSON in this format:
{{
  "thematic_tags": ["tag1", "tag2", "tag3"],
  "mood": "word",
  "target_audience": "phrase",
  "similar_archetypes": ["archetype1", "archetype2"]
}}
"""

display(show_prompt(metadata_prompt))

# Generate metadata using LLM
generated_metadata = ollama_generate_json(
    metadata_prompt,
    model=LLM_MODEL,
    temperature=0.3,  # Lower temperature for more structured output
)

display(show_response(generated_metadata))

Prompt:

You are a film analyst. Generate metadata for this movie:

Title: Forrest Gump (1994)
Genres: Comedy, Drama, Romance, War

Generate:
1. Three thematic tags (e.g., friendship, adventure)
2. Mood label (one word)
3. Target audience (one phrase)
4. Similar movie archetypes

Output ONLY valid JSON in this format:
{
  "thematic_tags": ["tag1", "tag2", "tag3"],
  "mood": "word",
  "target_audience": "phrase",
  "similar_archetypes": ["archetype1", "archetype2"]
}

LLM Response:

{
  "thematic_tags": [
    "destiny_and_fate",
    "innocence_and_perspective",
    "historical_footprint"
  ],
  "mood": "nostalgic",
  "target_audience": "adults_seeking_emotional_and_inspirational_storytelling",
  "similar_archetypes": [
    "the_everyman_hero",
    "the_wheelchair-bound_triumphant_underdog",
    "the_chronological_epic_with_romantic_undercurrents"
  ]
}

Use Cases:

Enhanced Search: Users can query “heartwarming family movies”
Mood-Based Recommendation: Filter by emotional tone
Richer Embeddings: Incorporate generated tags into item representations

Batch Metadata Generation

In production, you’d generate metadata for entire catalog. Let’s use multiple descriptive dimensions to create rich movie profiles.

Define Descriptive Dimensions

We’ll generate three types of descriptions for each movie:

Mood and Atmosphere: Emotional tone and viewing experience (e.g., “uplifting”, “tense”, “melancholic”)
Target Audience: Who would enjoy this movie and why (e.g., “families with young children”, “thriller enthusiasts”)
Plot Essence: Core narrative elements in 1-2 sentences (e.g., “A toy’s journey to find its owner”)

Show code

# Sample from most-rated movies
sample_movies = movies.join(most_rated_movies, on="movie_id", how="semi").sample(n=30, seed=42)

display(
    Markdown(f"**Sampled {len(sample_movies)} movies** from most-rated for metadata generation")
)
display(Markdown("**Sample titles:**"))
sample_movies.select("title")

Sampled 30 movies from most-rated for metadata generation

Sample titles:

shape: (30, 1)

title
str
"Wedding Crashers (2005)"
"Army of Darkness (1993)"
"Mad Max: Fury Road (2015)"
"Royal Tenenbaums, The (2001)"
"Shaun of the Dead (2004)"
"RoboCop (1987)"
"Speed (1994)"
"Magnolia (1999)"
"Dr. Strangelove or: How I Learned to Stop Worrying…
"Limitless (2011)"
…
"Dances with Wolves (1990)"
"There Will Be Blood (2007)"
"O Brother, Where Art Thou? (2000)"
"Star Wars: Episode I - The Phantom Menace (1999)"
"V for Vendetta (2006)"
"True Romance (1993)"
"As Good as It Gets (1997)"
"Looper (2012)"
"Rocky Horror Picture Show, The (1975)"
"Kung Fu Panda (2008)"

Show code

# Retrying multiple times in case of invalid response
# Small LLMs sometimes fail to follow instruction about structure


@retry(3, exceptions=(ValueError, json.JSONDecodeError))
def generate_movie_metadata(title, genres):
    """Generate multi-dimensional metadata using LLM.

    Returns:
        dict with keys: mood, target_audience, plot_essence
    """
    prompt = f"""You are a film critic. Describe this movie along three dimensions:

Title: {title}
Genres: {", ".join(genres) if genres else "Unknown"}

Generate brief descriptions (up to 2 sentences each):

1. Mood and Atmosphere: What's the emotional tone? How does it feel to watch?
2. Target Audience: Who would enjoy this and why?
3. Plot Essence: Core story in 1-2 sentences.

Output ONLY valid JSON:
{{
  "mood": "brief description",
  "target_audience": "brief description",
  "plot_essence": "brief description"
}}
"""
    metadata = ollama_generate_json(prompt, model=LLM_MODEL, temperature=0.3)
    required_keys = {"mood", "target_audience", "plot_essence"}
    if not required_keys.issubset(metadata):
        raise ValueError(f"One or more keys are missing: {required_keys - set(metadata)}")
    return metadata

Show code

# Generate metadata for all sampled movies
llm_metadata = []
print("Generating multi-dimensional metadata...")

for i, movie in enumerate(sample_movies.to_dicts()):
    print(f"  [{i + 1}/{len(sample_movies)}] {movie['title']}")
    metadata = generate_movie_metadata(movie["title"], movie["genres"])
    llm_metadata.append(metadata)

llm_metadata_df = pl.DataFrame(llm_metadata)
print("\n✅ Metadata generation complete!")

Generating multi-dimensional metadata...
  [1/30] Wedding Crashers (2005)
  [2/30] Army of Darkness (1993)
  [3/30] Mad Max: Fury Road (2015)
  [4/30] Royal Tenenbaums, The (2001)
  [5/30] Shaun of the Dead (2004)
  [6/30] RoboCop (1987)
  [7/30] Speed (1994)
  [8/30] Magnolia (1999)
  [9/30] Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)
  [10/30] Limitless (2011)
  [11/30] Ghostbusters (a.k.a. Ghost Busters) (1984)
  [12/30] Pirates of the Caribbean: At World's End (2007)
  [13/30] Training Day (2001)
  [14/30] Natural Born Killers (1994)
  [15/30] Air Force One (1997)
  [16/30] Arachnophobia (1990)
  [17/30] Vertigo (1958)
  [18/30] Dumb & Dumber (Dumb and Dumber) (1994)
  [19/30] Fight Club (1999)
  [20/30] Back to the Future Part III (1990)
  [21/30] Dances with Wolves (1990)
  [22/30] There Will Be Blood (2007)
  [23/30] O Brother, Where Art Thou? (2000)
  [24/30] Star Wars: Episode I - The Phantom Menace (1999)
  [25/30] V for Vendetta (2006)
  [26/30] True Romance (1993)
  [27/30] As Good as It Gets (1997)
  [28/30] Looper (2012)
  [29/30] Rocky Horror Picture Show, The (1975)
  [30/30] Kung Fu Panda (2008)

✅ Metadata generation complete!

Show code

show_response(llm_metadata[:3])

LLM Response:

[
  {
    "mood": "Lighthearted and witty, with a playful, fast-paced energy that balances charm and cheekiness\u2014leaving viewers grinning despite its morally dubious antics. The tone is warmly irreverent, blending romantic whimsy with comedic mischief, making it feel like a clever, carefree escape rather than a heavy-handed satire.",
    "target_audience": "Ideal for fans of sharp, dialogue-driven comedies who enjoy rom-com tropes subverted with humor (think *The Hangover* meets *How to Lose a Guy in 10 Days*), as well as viewers who appreciate raunchy but heartfelt stories with charismatic, flawed protagonists. Perfect for groups or solo watchers craving a breezy, laugh-out-loud experience.",
    "plot_essence": "Two con artists, John and Sage, exploit their charm to crash high-society weddings for free perks, but their latest scheme spirals into unexpected romantic entanglements when they befriend the groom\u2019s sister and her free-spirited friend\u2014blurring the line between scam and genuine connection."
  },
  {
    "mood": "A wild, chaotic blend of dark humor, adrenaline-fueled action, and gothic horror\u2014equal parts hilarious and unsettling, with a frenetic energy that keeps viewers laughing even as the stakes feel absurdly high. The atmosphere oscillates between campy fun and genuine scares, wrapped in a retro-futuristic, medieval-meets-modern aesthetic that feels both timeless and deliciously over-the-top.",
    "target_audience": "Fans of horror-comedy, action-packed satire, and cult classics will devour this; ideal for those who love Bruce Campbell\u2019s iconic one-liners, gory yet goofy violence, and a story that subverts expectations with sheer audacity. Also perfect for viewers who enjoy reboots-with-a-twist, as the film plays with time travel and undead legions in a way that\u2019s both clever and ridiculous.",
    "plot_essence": "A bumbling hero, Ash Williams, accidentally time-travels to a medieval world overrun by an undead army he himself unleashed centuries later, and must outwit both the dead and his past self to survive\u2014while delivering some of the most quotable lines in horror history."
  },
  {
    "mood": "A relentless, visceral frenzy of adrenaline and chaos\u2014equal parts exhilarating and oppressive, with a raw, post-apocalyptic grit that immerses viewers in a world of survival, brutality, and fleeting humanity, punctuated by breathtaking visuals and a pounding, immersive soundtrack.",
    "target_audience": "Fans of high-octane action, sci-fi spectacle, and visually stunning cinema will devour this; it also appeals to those who crave deep character arcs beneath the carnage, as well as audiences who appreciate genre-defying storytelling with emotional weight.",
    "plot_essence": "In a barren wasteland, Immortan Joe\u2019s tyrannical rule is challenged by his wife, Furiosa, who escapes with his five wives and a young warrior, Max, sparking a high-speed, blood-soaked chase across the desert as warlords, mutants, and Joe\u2019s relentless army hunt them down."
  }
]

Show code

enriched_df = pl.concat([sample_movies, llm_metadata_df], how="horizontal")

Show code

# Display first movie with full metadata
first_movie = enriched_df.to_dicts()[0]

display(
    Markdown(f"""
**Example: {first_movie["title"]}**

**Genres:** {", ".join(first_movie["genres"])}

**Mood:** {first_movie["mood"]}

**Target Audience:** {first_movie["target_audience"]}

**Plot Essence:** {first_movie["plot_essence"]}
""")
)

Example: Wedding Crashers (2005)

Genres: Comedy, Romance

Mood: Lighthearted and witty, with a playful, fast-paced energy that balances charm and cheekiness—leaving viewers grinning despite its morally dubious antics. The tone is warmly irreverent, blending romantic whimsy with comedic mischief, making it feel like a clever, carefree escape rather than a heavy-handed satire.

Target Audience: Ideal for fans of sharp, dialogue-driven comedies who enjoy rom-com tropes subverted with humor (think The Hangover meets How to Lose a Guy in 10 Days), as well as viewers who appreciate raunchy but heartfelt stories with charismatic, flawed protagonists. Perfect for groups or solo watchers craving a breezy, laugh-out-loud experience.

Plot Essence: Two con artists, John and Sage, exploit their charm to crash high-society weddings for free perks, but their latest scheme spirals into unexpected romantic entanglements when they befriend the groom’s sister and her free-spirited friend—blurring the line between scam and genuine connection.

Embed Each Dimension

Now let’s embed each metadata dimension separately. This allows us to find similar movies along each axis:

Show code

# Extract text for each dimension
mood_texts = enriched_df["mood"].to_list()
audience_texts = enriched_df["target_audience"].to_list()
plot_texts = enriched_df["plot_essence"].to_list()

print("Generating embeddings for each dimension...")
print(f"  - Mood: {len(mood_texts)} descriptions")
print(f"  - Target Audience: {len(audience_texts)} descriptions")
print(f"  - Plot Essence: {len(plot_texts)} descriptions")

Generating embeddings for each dimension...
  - Mood: 30 descriptions
  - Target Audience: 30 descriptions
  - Plot Essence: 30 descriptions

Show code

# Generate embeddings for each dimension (in batches)
print("\nEmbedding mood descriptions...")
mood_embeddings = []
for batch in itertools.batched(mood_texts, 10):
    batch_emb = ollama_embed(list(batch), model=EMBED_MODEL)
    mood_embeddings.extend(batch_emb)

print("Embedding target audience descriptions...")
audience_embeddings = []
for batch in itertools.batched(audience_texts, 10):
    batch_emb = ollama_embed(list(batch), model=EMBED_MODEL)
    audience_embeddings.extend(batch_emb)

print("Embedding plot essence descriptions...")
plot_embeddings = []
for batch in itertools.batched(plot_texts, 10):
    batch_emb = ollama_embed(list(batch), model=EMBED_MODEL)
    plot_embeddings.extend(batch_emb)

# Convert to numpy arrays (one row per movie)
mood_matrix = np.array(mood_embeddings)
audience_matrix = np.array(audience_embeddings)
plot_matrix = np.array(plot_embeddings)

print(f"\n✅ Embedding complete!")
print(f"  - Mood matrix shape: {mood_matrix.shape}")
print(f"  - Audience matrix shape: {audience_matrix.shape}")
print(f"  - Plot matrix shape: {plot_matrix.shape}")


Embedding mood descriptions...
Embedding target audience descriptions...
Embedding plot essence descriptions...

✅ Embedding complete!
  - Mood matrix shape: (30, 768)
  - Audience matrix shape: (30, 768)
  - Plot matrix shape: (30, 768)

Sort by Semantic Similarity

Now we can find movies similar to the first movie along each dimension using cosine similarity:

Show code

# Reference movie (first item)
reference_idx = 0
reference_movie = enriched_df.to_dicts()[reference_idx]

display(Markdown(f"**Reference Movie:** {reference_movie['title']}"))
display(Markdown(f"**Genres:** {', '.join(reference_movie['genres'])}"))

# Compute cosine similarity for each dimension
# Mood similarities
mood_ref = mood_matrix[reference_idx].reshape(1, -1)
mood_similarities = cosine_similarity(mood_ref, mood_matrix)[0]

# Audience similarities
audience_ref = audience_matrix[reference_idx].reshape(1, -1)
audience_similarities = cosine_similarity(audience_ref, audience_matrix)[0]

# Plot similarities
plot_ref = plot_matrix[reference_idx].reshape(1, -1)
plot_similarities = cosine_similarity(plot_ref, plot_matrix)[0]

Reference Movie: Wedding Crashers (2005)

Genres: Comedy, Romance

Show code

# Create dataframe with similarities for each dimension
similarity_df = enriched_df.select(
    ["movie_id", "title", "mood", "target_audience", "plot_essence"]
).with_columns(
    [
        pl.Series("mood_similarity", mood_similarities),
        pl.Series("audience_similarity", audience_similarities),
        pl.Series("plot_similarity", plot_similarities),
    ]
)

# Exclude the reference movie itself (similarity = 1.0)
other_movies_df = similarity_df.filter(pl.col("title") != reference_movie["title"])

print(f"✅ Computed similarities for {len(other_movies_df)} movies")

✅ Computed similarities for 29 movies

Show code

other_movies_df.select(
    ["title", "mood_similarity", "audience_similarity", "plot_similarity"]
).head()

shape: (5, 4)

title	mood_similarity	audience_similarity	plot_similarity
str	f64	f64	f64
"Army of Darkness (1993)"	0.639931	0.541269	0.340079
"Mad Max: Fury Road (2015)"	0.505394	0.547591	0.419636
"Royal Tenenbaums, The (2001)"	0.636462	0.626902	0.431996
"Shaun of the Dead (2004)"	0.514145	0.65193	0.432054
"RoboCop (1987)"	0.596327	0.555125	0.347083

Show code

display(Markdown("### Most Similar by Mood\n"))

sorted_by_mood = other_movies_df.sort("mood_similarity", descending=True)
for movie in sorted_by_mood.select(["title", "mood", "mood_similarity"]).head(5).to_dicts():
    display(
        Markdown(
            f"**• {movie['title']}** (similarity: {movie['mood_similarity']:.3f})  \n  *Mood: {movie['mood']}*\n"
        )
    )

Most Similar by Mood

• Dumb & Dumber (Dumb and Dumber) (1994) (similarity: 0.780)
Mood: A lighthearted, fast-paced comedy with a warm, absurdly silly tone—equal parts slapstick and heartfelt, leaving viewers grinning through its relentless physical humor and lovably clueless protagonists. The atmosphere is infectious, blending goofy escapades with a surprisingly cozy, buddy-movie charm.

• As Good as It Gets (1997) (similarity: 0.748)
Mood: A warm yet melancholic blend of sharp wit and poignant vulnerability, balancing humor with heartfelt moments—often leaving audiences both laughing and emotionally moved by its raw, humanistic charm.

• Ghostbusters (a.k.a. Ghost Busters) (1984) (similarity: 0.671)
Mood: A high-energy blend of witty humor, playful sci-fi spectacle, and lighthearted thrills—balancing absurdity with heartfelt camaraderie, making it feel like a fun, chaotic adventure that leaves you grinning from start to finish.

• Natural Born Killers (1994) (similarity: 0.667)
Mood: A hyper-stylized, chaotic blend of dark satire and visceral intensity—equal parts adrenaline-fueled and nihilistically campy, with a feverish, hallucinatory energy that oscillates between grotesque humor and brutal violence. The tone is unrelenting, self-aware, and deliberately over-the-top, leaving viewers emotionally whiplashed yet oddly mesmerized.

• Kung Fu Panda (2008) (similarity: 0.665)
Mood: Warm, uplifting, and hilariously energetic with a perfect blend of heartfelt moments and slapstick comedy—like a cozy noodle shop filled with laughter and unexpected wisdom, leaving you grinning from ear to ear.

Show code

display(Markdown("### Most Similar by Target Audience\n"))

sorted_by_audience = other_movies_df.sort("audience_similarity", descending=True)
for movie in (
    sorted_by_audience.select(["title", "target_audience", "audience_similarity"])
    .head(5)
    .to_dicts()
):
    display(
        Markdown(
            f"**• {movie['title']}** (similarity: {movie['audience_similarity']:.3f})  \n  *Audience: {movie['target_audience']}*\n"
        )
    )

Most Similar by Target Audience

• As Good as It Gets (1997) (similarity: 0.702)
Audience: Fans of character-driven comedies with depth, particularly those who appreciate dry humor, underdog narratives, and stories about redemption and connection; ideal for viewers who enjoy films like The Royal Tenenbaums* or Eternal Sunshine of the Spotless Mind in tone.*

• Dumb & Dumber (Dumb and Dumber) (1994) (similarity: 0.665)
Audience: Fans of raunchy, dumb-but-loveable humor will adore this, especially those who enjoy Jim Carrey’s early manic energy and farcical antics; ideal for viewers who crave laugh-out-loud, low-stakes comedy with a side of quirky friendship dynamics.

• Shaun of the Dead (2004) (similarity: 0.652)
Audience: Fans of clever, meta-comedy (especially Monty Python* and Zombieland enthusiasts) and horror-lovers who appreciate self-aware, satirical takes on classic tropes—ideal for those who enjoy witty banter, undead antics, and a protagonist as lovable as he is clueless.*

• Royal Tenenbaums, The (2001) (similarity: 0.627)
Audience: Fans of Wes Anderson’s signature style (quirky, visually precise, and dialogue-driven) and those who appreciate offbeat dramas with heart, as well as viewers who enjoy layered narratives that balance humor with poignant family dynamics.

• Looper (2012) (similarity: 0.615)
Audience: Fans of thought-provoking sci-fi with visceral action (e.g., Inception, The Matrix) and viewers who appreciate morally complex protagonists; also ideal for those who enjoy dark, twisty narratives with a mix of humor and existential themes. Genre enthusiasts seeking more than just explosions—substance over spectacle.

Show code

display(Markdown("### Most Similar by Plot Essence\n"))

sorted_by_plot = other_movies_df.sort("plot_similarity", descending=True)
for movie in sorted_by_plot.select(["title", "plot_essence", "plot_similarity"]).head(5).to_dicts():
    display(
        Markdown(
            f"**• {movie['title']}** (similarity: {movie['plot_similarity']:.3f})  \n  *Plot: {movie['plot_essence']}*\n"
        )
    )

Most Similar by Plot Essence

• True Romance (1993) (similarity: 0.519)
Plot: Clare (Patricia Arquette), a naive runaway, falls for Christian (Christian Slater), a small-time criminal with a penchant for violence, as their chaotic, drug-fueled romance spirals into a deadly game of cat-and-mouse with a ruthless crime lord (Christopher Walken). Their love story becomes a blood-soaked odyssey through the underbelly of Los Angeles, where survival is the only true romance.

• Vertigo (1958) (similarity: 0.518)
Plot: A detective with acrophobia becomes obsessed with a woman who resembles his late wife, only to uncover a web of deception, identity theft, and fatal obsession in this twisted tale of love and paranoia.

• Natural Born Killers (1994) (similarity: 0.515)
Plot: A deranged couple, Mickey and Mallory, embark on a murderous crime spree across America, while their exploits are sensationalized by a tabloid journalist and a TV producer, blurring the line between reality and media exploitation in a surreal, violent critique of celebrity culture and desensitization.

• O Brother, Where Art Thou? (2000) (similarity: 0.515)
Plot: After escaping a chain-gang prison, a lovable but clueless drifter and his two eccentric companions embark on a quixotic quest to find buried treasure, tangling with corrupt politicians, a siren-like seductress, and a series of bizarre misadventures in the Great Depression-era American South.

• Fight Club (1999) (similarity: 0.509)
Plot: An insomniac office worker forms an underground fight club with a charismatic soap salesman, spiraling into a violent crusade against societal conformity—only to uncover a shadowy conspiracy that blurs the line between his alter ego and a terrorist cult.

Key Insights:

Multi-dimensional similarity: Same movie can be similar to different movies along different axes
Semantic understanding: LLM embeddings capture meaning, not just keyword matching
Flexible discovery: Users can find movies by mood, audience fit, or plot similarity

This enables: - Mood-based browsing: “Show me movies with a similar atmosphere” - Audience-targeted recommendations: “Find movies for the same demographic” - Plot-based discovery: “Movies with similar narrative structures”

Visual Comparison: Three Dimensions of Similarity

Show code

# Get top 5 movies for each dimension
mood_top5 = sorted_by_mood.head(5)
audience_top5 = sorted_by_audience.head(5)
plot_top5 = sorted_by_plot.head(5)

# Join with posters (via links to get tmdb_id)
links_tmdb = links.select(["movie_id", "tmdb_id"])
mood_posters = mood_top5.join(links_tmdb, on="movie_id").join(
    posters, on="tmdb_id", how="inner", maintain_order="left"
)
audience_posters = audience_top5.join(links_tmdb, on="movie_id").join(
    posters, on="tmdb_id", how="inner", maintain_order="left"
)
plot_posters = plot_top5.join(links_tmdb, on="movie_id").join(
    posters, on="tmdb_id", how="inner", maintain_order="left"
)

# Display each dimension
display(Markdown(f"**Reference:** {reference_movie['title']}\n"))

display(Markdown("**Most Similar by MOOD:**"))
display(tmdb_images(mood_posters["poster_path"].to_list()))

display(Markdown("***\n**Most Similar by TARGET AUDIENCE:**"))
display(tmdb_images(audience_posters["poster_path"].to_list()))

display(Markdown("***\n**Most Similar by PLOT ESSENCE:**"))
display(tmdb_images(plot_posters["poster_path"].to_list()))

Reference: Wedding Crashers (2005)

Most Similar by MOOD:

Most Similar by TARGET AUDIENCE:

Most Similar by PLOT ESSENCE:

Notice how the same reference movie yields different recommendations depending on which dimension we prioritize!

True Cold Start: Unknown Content

The Cold Start Spectrum

So far, we’ve been handling “cold for our system” items:

Movies like “Toy Story”, “The Matrix” are well-known to LLMs
LLMs have extensive training data about these items
We can leverage their world knowledge

True cold start = Content completely unknown to the LLM:

Brand new videos, podcasts, articles
User-generated content (social media)
Internal company content
Truly novel items

Example: Short-Form Video

Let’s demonstrate with a short video that didn’t exist during LLM training. We have:

Transcript: What’s said in the video
Screenshots: Visual frames from the video

Show code

# Load transcript
with open("../data/shared/short_form/transcript.txt", "r") as f:
    video_transcript = f.read()

display(Markdown(f"**Video Transcript:**\n\n```\n{video_transcript}\n```"))

Video Transcript:

🎬 Transcript: “Cappuccino vs Flat White — What Do People Actually Prefer?”

[0:00–0:04 | Hook — holding two cups]
“Most people think these two coffees are the same… but they’re really not.”

[0:04–0:08 | Quick cuts — foam + pour]
“Let me show you why.”

[0:08–0:18 | Making coffee — you at machine]
“Both start the same — espresso and milk.”
“But this one—” (steam milk longer, airy)
“—gets thick, fluffy foam.”
“And this one—” (smooth pour)
“—is all about silky, smooth milk.”

[0:18–0:22 | Hold both drinks up]
“Cappuccino… vs flat white.”

[0:22–0:40 | Café interviews — fast cuts]
You: “Quick question — what’s the difference?”

Person 1: “Uhh… size?”
Person 2: “Flat white is stronger?”
Person 3: “I have no idea.” (laughs)
Person 4 (confident): “Cappuccino has more foam.”

You (to camera):
“Okay… mixed results.”

[0:40–0:55 | Blind taste test]
“Let’s test it.”

You: “Which one is stronger?”
Person: “This one.” (points)
Another: “This one’s smoother.”

You (reveal):
“This is the flat white.”

[0:55–1:05 | Simple explanation — direct to camera]
“Here’s the trick:”
“Cappuccino has more foam — so it feels lighter.”
“Flat white has less foam — so the coffee tastes stronger.”

[1:05–1:15 | Visual demo — spoon in foam]
“Look at this.” (spoon sits on cappuccino foam)
“And this…” (spoon sinks in flat white)
“Totally different texture.”

[1:15–1:22 | Optional twist — barista or comment]
Barista (or you):
“Honestly, it also depends on who makes it.”

[1:22–1:30 | Ending — you sipping both]
“So… which one are you choosing?”

Text on screen:
“Team Cappuccino 🫧 or Team Flat White 🥛?”

Generate Video Description

First, ask the LLM to create a coherent description from the raw transcript:

Show code

description_prompt = f"""You are a content analyst. Below is a transcript from a short-form video.

Create a concise, engaging description of this video (2-3 sentences) that captures:
- The main topic/theme
- The format/style
- The key takeaway

Transcript:
{video_transcript}

Output ONLY the description text (no extra formatting):
"""

video_description = ollama_generate(description_prompt, model=LLM_MODEL, temperature=0.3)

display(Markdown(f"**Generated Video Description:**\n\n{video_description}"))

Generated Video Description:

“Cappuccino vs Flat White: The Coffee Showdown You Didn’t Know You Needed!”

This fast-paced, visually driven video breaks down the key differences between cappuccinos and flat whites—from foam texture to coffee strength—with a fun blind taste test and real café interviews. The takeaway? Cappuccinos are lighter and foamier, while flat whites pack a smoother, bolder punch. Which team are you on? ☕

Generate Metadata for Unknown Content

Now generate rich metadata from the description:

Show code

metadata_prompt = f"""You are a content analyst. Generate metadata for this video.

Video Description:
{video_description}

Original Transcript (for context):
{video_transcript[:500]}...

Generate metadata along the same dimensions as before:

Output ONLY valid JSON:
{{
  "mood": "brief description of emotional tone",
  "target_audience": "who would enjoy this and why",
  "plot_essence": "core narrative/content in 1-2 sentences",
  "tags": ["tag1", "tag2", "tag3", "tag4", "tag5"]
}}
"""

video_metadata = ollama_generate_json(metadata_prompt, model=LLM_MODEL, temperature=0.3)

display(Markdown("**Generated Metadata:**"))
show_response(video_metadata)

Generated Metadata:

LLM Response:

{
  "mood": "engaging, educational, and lighthearted with a competitive yet informative tone\u2014blends humor and visual contrast to make coffee culture accessible",
  "target_audience": [
    {
      "group": "coffee enthusiasts and casual drinkers",
      "reason": "seeks to clarify subtle differences between popular espresso-based drinks in an entertaining way"
    },
    {
      "group": "baristas, caf\u00e9 workers, or aspiring coffee professionals",
      "reason": "reinforces technical distinctions (foam texture, milk integration) and real-world preparation insights"
    },
    {
      "group": "social media users (TikTok/Reels/Shorts audiences)",
      "reason": "fast-paced, visually driven format with a 'showdown' hook ideal for shareable, bite-sized content"
    },
    {
      "group": "travelers or tourists in coffee-centric cities",
      "reason": "helps navigate caf\u00e9 menus by demystifying ordering choices"
    }
  ],
  "plot_essence": "A visually dynamic comparison of cappuccinos (airy foam, lighter ratio) and flat whites (smooth microfoam, bolder coffee flavor) through side-by-side preparation, blind taste tests, and caf\u00e9 interviews\u2014challenging misconceptions while celebrating the nuanced art of milk integration.",
  "tags": [
    "coffee culture",
    "barista techniques",
    "food vs food comparisons",
    "blind taste test",
    "caf\u00e9 education",
    "espresso drinks",
    "visual storytelling",
    "quick tips",
    "travel-friendly content",
    "foam texture analysis",
    "milk steaming demo",
    "coffee showdown",
    "accessible learning",
    "social media viral potential",
    "latte art adjacent",
    "beverage science"
  ]
}

Find Similar Movies

Now we can find movies similar to this completely new content:

Show code

# Embed the video description
video_embedding = ollama_embed(video_description, model=EMBED_MODEL)
video_vec = np.array(video_embedding).reshape(1, -1)

# Compare to movie plot embeddings from earlier
video_plot_similarities = cosine_similarity(video_vec, plot_matrix)[0]

# Create results dataframe
video_similarity_df = enriched_df.select(["title", "plot_essence"]).with_columns(
    [pl.Series("similarity", video_plot_similarities)]
)

display(
    Markdown(
        f"Comparing video to **{len(video_similarity_df)} movies**...\n\n### Top 5 Most Similar Movies by Content\n"
    )
)

top_similar = video_similarity_df.sort("similarity", descending=True).head(5)
for movie in top_similar.to_dicts():
    display(
        Markdown(
            f"**• {movie['title']}** (similarity: {movie['similarity']:.3f})  \n  *Plot: {movie['plot_essence']}*\n"
        )
    )

Comparing video to 30 movies…

Key Takeaways

Zero-Shot Recommendation: LLMs can recommend without training data by leveraging semantic understanding
Cold-Start Problem Solved: Generate metadata for completely unknown content (videos, podcasts, articles)
Metadata Enrichment: LLMs generate rich descriptions across multiple dimensions (mood, audience, plot)
Multi-Dimensional Similarity: Different aspects (mood, audience, plot) yield different recommendations
Semantic Embeddings: Capture nuanced relationships beyond keyword matching

Important Papers:

P5: Treats recommendation as sequence-to-sequence task (Geng et al., 2022)
LLMRec: Uses LLM as ranker for candidate items (Hou et al., 2023)

Next: Part IV.a - Conversational Recommendation with Keyphrases!

References

Geng, S., Liu, S., Fu, Z., Ge, Y., & Zhang, Y. (2022). Recommendation as language processing (P5): A unified pretrain, personalized prompt & predict paradigm. Proceedings of the 16th ACM Conference on Recommender Systems, 299–315. https://doi.org/10.1145/3523227.3546767

Hou, Y., Zhang, J., Lin, Z., Lu, H., Xie, R., McAuley, J., & Zhao, W. X. (2023). Large language models are zero-shot rankers for recommender systems. arXiv Preprint arXiv:2305.08845. https://arxiv.org/abs/2305.08845

Zhang, Y., Hou, Y., Zhao, W. X., et al. (2023). Collaborative large language model for recommender systems. arXiv Preprint arXiv:2311.01343. https://arxiv.org/abs/2311.01343