Automatic Music Playlist Generation via Simulation-based Reinforcement Learning
Executive Summary
This paper explores how Spotify curates playlists using Reinforcement Learning. Unlike traditional methods like collaborative filtering, RL accounts for acoustic coherence and listening context. The researchers modeled the problem as a Markov Decision Process, where the agent learns to balance recommending new tracks vs. familiar ones.
Environment & User Modeling
The framework uses a "world model" to simulate user behavior. This allows for offline training without exposing real users to sub-optimal recommendations. The environment consists of a transition function (State × Action → New State) and a reward function.
Sequential Model (SWM)
Uses LSTM cells to ensure track order matters. More accurate but computationally expensive to train.
Non-Sequential (CWM)
Faster dense layer model where sequence doesn't matter. Efficient for initial agent training.
Conclusion
The study successfully demonstrated that model-based RL is feasible for playlist generation, performing as well as current production models while allowing for safer offline experimentation.