Back to Research
Reinforcement Learning Spotify Research • 2023

Automatic Music Playlist Generation via Simulation-based Reinforcement Learning

Summary by Ritom Sen

Executive Summary

This paper explores how Spotify curates playlists using Reinforcement Learning. Unlike traditional methods like collaborative filtering, RL accounts for acoustic coherence and listening context. The researchers modeled the problem as a Markov Decision Process, where the agent learns to balance recommending new tracks vs. familiar ones.

Environment & User Modeling

The framework uses a "world model" to simulate user behavior. This allows for offline training without exposing real users to sub-optimal recommendations. The environment consists of a transition function (State × Action → New State) and a reward function.

"In simpler terms: An agent picks songs, and based on contextual info, a user model predicts the response (skip or listen), which translates to a reward used to update the agent."

Sequential Model (SWM)

Uses LSTM cells to ensure track order matters. More accurate but computationally expensive to train.

Non-Sequential (CWM)

Faster dense layer model where sequence doesn't matter. Efficient for initial agent training.

Conclusion

The study successfully demonstrated that model-based RL is feasible for playlist generation, performing as well as current production models while allowing for safer offline experimentation.

Key Concept

A framework for solving decision problems by building agents that trial and error against an environment and learn based on rewards.