Files
lidify/docs/vibe-matching-research-review.md
2025-12-25 18:58:06 -06:00

26 KiB
Raw Blame History

Lidify Vibe Matching System - Research Review Document

Executive Summary

This document provides a complete overview of Lidify's audio-based music recommendation ("vibe matching") system for research review. The system uses ML-based audio analysis to find similar songs based on how they sound, not metadata or collaborative filtering.


Sample Results (Live Terminal Output)

Example 1: Piano Music ("I Love You" by RIOPY)

SOURCE: "I Love You" by RIOPY
  Album: RIOPY
  Analysis Mode: enhanced
  BPM: 91.3 | Energy: 0.28 | Valence: 0.53
  Danceability: 0.96 | Arousal: 0.52 | Key: major
  ML Moods: Happy=0.91, Sad=0.65, Relaxed=1.00, Aggressive=0.99
  Mood Tags: sad, dance, chill, melancholic, relaxed, uplifting, aggressive, intense, groovy, happy

TOP MATCHES (by cosine similarity):
#   | TRACK                           | ARTIST           | BPM  | ENG  | VAL  | H    | S    | R    | A    
----|--------------------------------|------------------|------|------|------|------|------|------|------
1   | Minimal Game                    | RIOPY            | 84   | 0.25 | 0.51 | 0.70 | 0.20 | 0.80 | 0.76
2   | Lullaby                         | RIOPY            | 82   | 0.28 | 0.54 | 0.75 | 0.20 | 0.80 | 0.76
3   | Joy                             | RIOPY            | 97   | 0.34 | 0.57 | 0.98 | 0.58 | 1.00 | 0.99
4   | Introspective (From Home)       | Dirk Maassen     | 94   | 0.32 | 0.55 | 0.79 | 0.20 | 0.80 | 0.80
5   | Sweet dream                     | RIOPY            | 91   | 0.28 | 0.48 | 0.64 | 0.20 | 0.80 | 0.77
6   | Sense of hope                   | RIOPY            | 99   | 0.25 | 0.53 | 0.74 | 0.20 | 0.80 | 0.78
7   | Drive                           | RIOPY            | 96   | 0.44 | 0.55 | 0.78 | 0.20 | 0.80 | 0.78
8   | Air (From Home)                 | Dirk Maassen     | 81   | 0.14 | 0.56 | 0.79 | 0.20 | 0.80 | 0.76
9   | Prelude                         | Muse             | 85   | 0.39 | 0.40 | 0.68 | 0.70 | 0.96 | 1.00
10  | Towards the Sun                 | Dirk Maassen     | 117  | 0.25 | 0.49 | 0.66 | 0.20 | 0.80 | 0.80

Observation: Piano music correctly matches with other piano composers (RIOPY, Dirk Maassen).


Example 2: Alt-Rock ("You and I" by Pvris)

SOURCE: "You and I" by Pvris
  Album: White Noise
  Analysis Mode: enhanced
  BPM: 101.9 | Energy: 0.57 | Valence: 0.50
  Danceability: 1.00 | Arousal: 0.44 | Key: major
  ML Moods: Happy=0.49, Sad=0.31, Relaxed=0.44, Aggressive=0.68
  Mood Tags: intense, dance, aggressive, groovy

TOP MATCHES:
#   | TRACK                           | ARTIST           | BPM  | ENG  | VAL  | H    | S    | R    | A    
----|--------------------------------|------------------|------|------|------|------|------|------|------
1   | Tether                          | CHVRCHES         | 120  | 0.52 | 0.47 | 0.43 | 0.28 | 0.50 | 0.69
2   | By The Throat (Live)            | CHVRCHES         | 118  | 0.50 | 0.52 | 0.37 | 0.20 | 0.34 | 0.72
3   | Separate                        | Pvris            | 90   | 0.64 | 0.52 | 0.49 | 0.26 | 0.40 | 0.85
4   | Strong Hand (Live)              | CHVRCHES         | 80   | 0.58 | 0.60 | 0.55 | 0.34 | 0.34 | 0.74
5   | Stay Gold                       | Pvris            | 100  | 0.72 | 0.57 | 0.47 | 0.25 | 0.35 | 0.80
6   | I Like The Devil                | Purity Ring      | 100  | 0.65 | 0.54 | 0.60 | 0.31 | 0.43 | 0.92
7   | Madness (Live)                  | Muse             | 92   | 0.78 | 0.62 | 0.77 | 0.52 | 0.57 | 0.77

Observation: Synth-pop/alt-rock correctly matches with similar artists (CHVRCHES, Pvris, Purity Ring).


Example 3: Rock ("Supermassive Black Hole" by Muse)

SOURCE: "Supermassive Black Hole" by Muse
  Album: HAARP
  Analysis Mode: enhanced
  BPM: 120.1 | Energy: 0.67 | Valence: 0.56
  Danceability: 1.00 | Arousal: 0.42 | Key: minor
  ML Moods: Happy=0.72, Sad=0.64, Relaxed=0.16, Aggressive=0.22
  Mood Tags: sad, dance, melancholic, uplifting, groovy, happy

TOP MATCHES:
#   | TRACK                           | ARTIST           | BPM  | ENG  | VAL  | H    | S    | R    | A    
----|--------------------------------|------------------|------|------|------|------|------|------|------
1   | Supermassive Black Hole (Live)  | Muse             | 120  | 0.75 | 0.56 | 0.76 | 0.58 | 0.06 | 0.04
2   | Thought Contagion (Live)        | Muse             | 140  | 0.76 | 0.57 | 0.77 | 0.52 | 0.08 | 0.09
3   | Let Them In                     | Pvris            | 146  | 0.64 | 0.62 | 0.67 | 0.50 | 0.22 | 0.22
4   | Panic Station (Live)            | Muse             | 105  | 0.69 | 0.47 | 0.61 | 0.61 | 0.02 | 0.03
5   | Smoke                           | Pvris            | 150  | 0.57 | 0.56 | 0.64 | 0.66 | 0.20 | 0.30
6   | Animals                         | Muse             | 113  | 0.82 | 0.55 | 0.79 | 0.59 | 0.24 | 0.21

Observation: Rock music correctly matches with other Muse tracks and similar-sounding rock/alt artists.


System Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                         AUDIO ANALYSIS PIPELINE                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  ┌─────────────┐     ┌─────────────────────────────────────────────────┐   │
│  │ Audio File  │────►│         Essentia Audio Processing                │   │
│  │ (.flac/.mp3)│     │                                                  │   │
│  └─────────────┘     │  • FFT/Spectral Analysis                        │   │
│                      │  • Beat/Tempo Detection                          │   │
│                      │  • Key/Scale Detection                           │   │
│                      │  • RMS Energy Calculation                        │   │
│                      └─────────────┬────────────────────────────────────┘   │
│                                    │                                         │
│                      ┌─────────────▼────────────────────────────────────┐   │
│                      │      MusiCNN (TensorFlow Model)                   │   │
│                      │                                                   │   │
│                      │  Input: 16kHz mono audio                         │   │
│                      │  Output: 200-dimensional embeddings              │   │
│                      │  Architecture: Convolutional Neural Network      │   │
│                      │  Training: Million Song Dataset (MSD)            │   │
│                      └─────────────┬────────────────────────────────────┘   │
│                                    │                                         │
│           ┌────────────────────────┼────────────────────────────┐           │
│           │                        │                            │           │
│           ▼                        ▼                            ▼           │
│  ┌─────────────────┐    ┌─────────────────┐          ┌─────────────────┐   │
│  │  Mood Happy     │    │  Mood Sad       │    ...   │  Danceability   │   │
│  │  Classifier     │    │  Classifier     │          │  Classifier     │   │
│  │  (Softmax)      │    │  (Softmax)      │          │  (Softmax)      │   │
│  └────────┬────────┘    └────────┬────────┘          └────────┬────────┘   │
│           │                      │                            │             │
│           └──────────────────────┼────────────────────────────┘             │
│                                  │                                          │
│                      ┌───────────▼───────────┐                             │
│                      │   DERIVED FEATURES    │                             │
│                      │                       │                             │
│                      │  Valence = f(happy, party, sad)                     │
│                      │  Arousal = f(aggressive, party, electronic,         │
│                      │            relaxed, acoustic)                        │
│                      └───────────────────────┘                             │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────────┐
│                         VIBE MATCHING ALGORITHM                              │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                              │
│  1. Build Feature Vector (13 dimensions):                                   │
│     [moodHappy, moodSad, moodRelaxed, moodAggressive, moodParty,           │
│      moodAcoustic, moodElectronic, energy, arousal, danceability,           │
│      instrumentalness, normalizedBPM, keyMode]                              │
│                                                                              │
│  2. Compute Cosine Similarity:                                              │
│                    Σ(aᵢ × bᵢ)                                               │
│     cos(θ) = ─────────────────────                                         │
│              √(Σaᵢ²) × √(Σbᵢ²)                                             │
│                                                                              │
│  3. Add Tag/Genre Bonus (max 5%):                                           │
│     Jaccard similarity on lastfmTags  essentiaGenres                       │
│                                                                              │
│  4. Final Score = 0.95 × cosineSim + tagBonus                               │
│                                                                              │
│  5. Filter threshold: 40% (Enhanced) or 50% (Standard)                      │
│                                                                              │
└─────────────────────────────────────────────────────────────────────────────┘

Data Schema (What We Store Per Track)

Database Schema (PostgreSQL + Prisma)

-- Track table audio analysis columns
model Track {
  -- Basic Info
  id            String   @id
  title         String
  albumId       String
  duration      Int      -- seconds
  filePath      String   -- relative path to audio file
  
  -- === RHYTHM ANALYSIS (Essentia) ===
  bpm           Float?   -- beats per minute (60-200 typical)
  beatsCount    Int?     -- total beats in track
  
  -- === TONALITY (Essentia) ===
  key           String?  -- musical key ("C", "F#", "Bb", etc.)
  keyScale      String?  -- "major" or "minor"
  keyStrength   Float?   -- confidence 0-1
  
  -- === ENERGY & DYNAMICS (Essentia) ===
  energy        Float?   -- overall energy 0-1 (RMS-based)
  loudness      Float?   -- average loudness in dB
  dynamicRange  Float?   -- dynamic range in dB
  
  -- === BASIC AUDIO FEATURES ===
  danceability      Float?   -- 0-1 how suitable for dancing
  valence           Float?   -- 0 (sad) to 1 (happy) - DERIVED
  arousal           Float?   -- 0 (calm) to 1 (energetic) - DERIVED
  
  -- === INSTRUMENTATION ===
  instrumentalness  Float?   -- 0-1 (1 = no vocals) - ML predicted
  acousticness      Float?   -- 0-1 (1 = acoustic)
  speechiness       Float?   -- 0-1 (1 = spoken word)
  
  -- === ML MOOD PREDICTIONS (Enhanced Mode) ===
  -- These are the core ML outputs from MusiCNN classifiers
  moodHappy         Float?   -- ML prediction 0-1 (probability of happy)
  moodSad           Float?   -- ML prediction 0-1 (probability of sad)
  moodRelaxed       Float?   -- ML prediction 0-1 (probability of relaxed)
  moodAggressive    Float?   -- ML prediction 0-1 (probability of aggressive)
  moodParty         Float?   -- ML prediction 0-1 (probability of party/upbeat)
  moodAcoustic      Float?   -- ML prediction 0-1 (probability of acoustic)
  moodElectronic    Float?   -- ML prediction 0-1 (probability of electronic)
  danceabilityMl    Float?   -- ML-based danceability (more accurate)
  
  -- === DERIVED TAGS ===
  moodTags          String[] -- ["aggressive", "happy", "chill", "workout"]
  essentiaGenres    String[] -- ["rock", "electronic", "jazz"]
  lastfmTags        String[] -- ["chill", "workout", "sad", "90s"]
  
  -- === ANALYSIS METADATA ===
  analysisStatus    String   -- pending, processing, completed, failed
  analysisMode      String?  -- 'standard' or 'enhanced'
  analysisVersion   String?  -- Essentia version used
  analyzedAt        DateTime?
}

Core Algorithm: Feature Extraction (Python)

analyzer.py - ML Feature Extraction

def _extract_ml_features(self, audio_16k) -> Dict[str, Any]:
    """
    Extract features using Essentia MusiCNN + classification heads.
    
    Architecture:
    1. TensorflowPredictMusiCNN extracts embeddings from audio
    2. TensorflowPredict2D classification heads output predictions
    """
    result = {}
    
    # Step 1: Get embeddings from base MusiCNN model
    # Output shape: [frames, 200] - 200-dimensional embedding per frame
    embeddings = self.musicnn_model(audio_16k)
    
    # Step 2: Pass embeddings through classification heads
    # Each head outputs [frames, 2] where [:, 1] is probability of positive class
    
    # Collect raw predictions
    if 'mood_happy' in self.prediction_models:
        preds = self.prediction_models['mood_happy'](embeddings)
        result['moodHappy'] = float(np.mean(preds[:, 1]))
    
    if 'mood_sad' in self.prediction_models:
        preds = self.prediction_models['mood_sad'](embeddings)
        result['moodSad'] = float(np.mean(preds[:, 1]))
    
    if 'mood_relaxed' in self.prediction_models:
        preds = self.prediction_models['mood_relaxed'](embeddings)
        result['moodRelaxed'] = float(np.mean(preds[:, 1]))
    
    if 'mood_aggressive' in self.prediction_models:
        preds = self.prediction_models['mood_aggressive'](embeddings)
        result['moodAggressive'] = float(np.mean(preds[:, 1]))
    
    if 'mood_party' in self.prediction_models:
        preds = self.prediction_models['mood_party'](embeddings)
        result['moodParty'] = float(np.mean(preds[:, 1]))
    
    if 'mood_acoustic' in self.prediction_models:
        preds = self.prediction_models['mood_acoustic'](embeddings)
        result['moodAcoustic'] = float(np.mean(preds[:, 1]))
    
    if 'mood_electronic' in self.prediction_models:
        preds = self.prediction_models['mood_electronic'](embeddings)
        result['moodElectronic'] = float(np.mean(preds[:, 1]))
    
    # === VALENCE (derived from mood models) ===
    # Valence = emotional positivity: happy/party vs sad
    happy = result.get('moodHappy', 0.5)
    sad = result.get('moodSad', 0.5)
    party = result.get('moodParty', 0.5)
    result['valence'] = round(happy * 0.5 + party * 0.3 + (1 - sad) * 0.2, 3)
    
    # === AROUSAL (derived from mood models) ===
    # Arousal = energy level: aggressive/party/electronic vs relaxed/acoustic
    aggressive = result.get('moodAggressive', 0.5)
    relaxed = result.get('moodRelaxed', 0.5)
    acoustic = result.get('moodAcoustic', 0.5)
    electronic = result.get('moodElectronic', 0.5)
    result['arousal'] = round(
        aggressive * 0.35 + 
        party * 0.25 + 
        electronic * 0.2 + 
        (1 - relaxed) * 0.1 + 
        (1 - acoustic) * 0.1, 
        3
    )
    
    return result

Core Algorithm: Cosine Similarity Matching (TypeScript)

library.ts - Vibe Matching Implementation

// === COSINE SIMILARITY SCORING ===
// Industry-standard approach: build feature vectors, compute cosine similarity
// Uses ALL 13 features for comprehensive matching

// Helper: Build normalized feature vector from track
const buildFeatureVector = (track: TrackFeatures): number[] => {
    return [
        // ML Mood predictions (7 features) - 0.5 default for missing
        track.moodHappy ?? 0.5,
        track.moodSad ?? 0.5,
        track.moodRelaxed ?? 0.5,
        track.moodAggressive ?? 0.5,
        track.moodParty ?? 0.5,
        track.moodAcoustic ?? 0.5,
        track.moodElectronic ?? 0.5,
        // Audio features (5 features)
        track.energy ?? 0.5,
        track.arousal ?? 0.5,
        track.danceabilityMl ?? track.danceability ?? 0.5,
        track.instrumentalness ?? 0.5,
        // BPM normalized to 0-1 (60-180 BPM range)
        Math.max(0, Math.min(1, ((track.bpm ?? 120) - 60) / 120)),
        // Key: major=1, minor=0, unknown=0.5
        track.keyScale === 'major' ? 1 : track.keyScale === 'minor' ? 0 : 0.5,
    ];
};

// Helper: Compute cosine similarity between two vectors
const cosineSimilarity = (a: number[], b: number[]): number => {
    let dot = 0, magA = 0, magB = 0;
    for (let i = 0; i < a.length; i++) {
        dot += a[i] * b[i];
        magA += a[i] * a[i];
        magB += b[i] * b[i];
    }
    if (magA === 0 || magB === 0) return 0;
    return dot / (Math.sqrt(magA) * Math.sqrt(magB));
};

// Helper: Compute tag overlap bonus
const computeTagBonus = (
    sourceTags: string[], 
    sourceGenres: string[],
    trackTags: string[],
    trackGenres: string[]
): number => {
    const sourceSet = new Set([...sourceTags, ...sourceGenres].map(t => t.toLowerCase()));
    const trackSet = new Set([...trackTags, ...trackGenres].map(t => t.toLowerCase()));
    if (sourceSet.size === 0 || trackSet.size === 0) return 0;
    const overlap = [...sourceSet].filter(tag => trackSet.has(tag)).length;
    // Max 5% bonus for tag overlap
    return Math.min(0.05, overlap * 0.01);
};

// Score all candidate tracks
const scored = analyzedTracks.map(t => {
    const targetVector = buildFeatureVector(t);
    
    // Compute base cosine similarity
    let score = cosineSimilarity(sourceVector, targetVector);
    
    // Add tag/genre overlap bonus (max 5%)
    const tagBonus = computeTagBonus(
        sourceTrack.lastfmTags || [],
        sourceTrack.essentiaGenres || [],
        t.lastfmTags || [],
        t.essentiaGenres || []
    );
    
    // Final score: 95% cosine similarity + 5% tag bonus
    const finalScore = score * 0.95 + tagBonus;
    
    return { id: t.id, score: finalScore };
});

// Filter to good matches (>40% for Enhanced, >50% for Standard)
const minThreshold = isEnhancedAnalysis ? 0.40 : 0.50;
const goodMatches = scored
    .filter(t => t.score > minThreshold)
    .sort((a, b) => b.score - a.score);

Feature Vector Breakdown

Index Feature Range Description Weight Rationale
0 moodHappy 0-1 ML probability of happy mood Core mood dimension
1 moodSad 0-1 ML probability of sad mood Core mood dimension
2 moodRelaxed 0-1 ML probability of relaxed mood Core mood dimension
3 moodAggressive 0-1 ML probability of aggressive mood Core mood dimension
4 moodParty 0-1 ML probability of party/upbeat Core mood dimension
5 moodAcoustic 0-1 ML probability of acoustic sound Instrumentation
6 moodElectronic 0-1 ML probability of electronic sound Instrumentation
7 energy 0-1 RMS-based energy level Audio characteristic
8 arousal 0-1 Derived energy/intensity Composite dimension
9 danceability 0-1 ML or Essentia danceability Rhythm characteristic
10 instrumentalness 0-1 Voice/instrumental ML detection Instrumentation
11 normalizedBPM 0-1 (bpm - 60) / 120 Tempo matching
12 keyMode 0/0.5/1 minor/unknown/major Tonality

Valence & Arousal Derivation

Since Essentia doesn't have direct valence/arousal models, we derive them from mood predictions:

Valence (Emotional Positivity)

valence = moodHappy * 0.5 + moodParty * 0.3 + (1 - moodSad) * 0.2

Rationale:

  • Happy mood is the strongest positive indicator (50% weight)
  • Party/upbeat suggests positive energy (30% weight)
  • Low sadness contributes to positivity (20% weight)

Arousal (Energy Level)

arousal = moodAggressive * 0.35 + moodParty * 0.25 + moodElectronic * 0.2 
        + (1 - moodRelaxed) * 0.1 + (1 - moodAcoustic) * 0.1

Rationale:

  • Aggressive music is high-energy (35% weight)
  • Party music has high arousal (25% weight)
  • Electronic music tends to be energetic (20% weight)
  • Low relaxation indicates higher energy (10% weight)
  • Non-acoustic sound suggests higher energy (10% weight)

Known Limitations & Edge Cases

1. Out-of-Distribution Audio

MusiCNN was trained on the Million Song Dataset (mostly pop/rock). For genres outside this distribution (classical, ambient, piano), the model sometimes outputs high values for ALL mood dimensions.

Detection & Normalization:

core_moods = ['moodHappy', 'moodSad', 'moodRelaxed', 'moodAggressive']
core_values = [raw_moods[m][0] for m in core_moods if m in raw_moods]

if len(core_values) >= 4:
    min_mood = min(core_values)
    max_mood = max(core_values)
    
    # If all core moods are > 0.7 AND the range is small,
    # the predictions are likely unreliable (out-of-distribution audio)
    if min_mood > 0.7 and (max_mood - min_mood) < 0.3:
        # Normalize: scale so max becomes 0.8 and min becomes 0.2
        for mood_key in core_moods:
            old_val = raw_moods[mood_key][0]
            normalized = 0.2 + (old_val - min_mood) / (max_mood - min_mood) * 0.6
            raw_moods[mood_key] = normalized

2. Standard Mode Fallback

When ML models aren't available, heuristic estimates are used:

Feature Heuristic Formula
Valence key_valence * 0.4 + bpm_valence * 0.25 + brightness * 0.2 + energy * 0.15
Arousal bpm_arousal * 0.35 + energy * 0.35 + brightness * 0.15 + compression * 0.15
Instrumentalness spectral_flatness * 0.6 + zcr_instrumental * 0.4
Acousticness dynamic_range / 12

3. Feature Vector Missing Values

Missing values default to 0.5 (neutral) to prevent bias:

track.moodHappy ?? 0.5

Open Questions for Review

  1. Feature Weighting: Currently all 13 features have equal weight in cosine similarity. Should mood features (indices 0-6) have higher weight than audio features?

  2. Threshold Selection: We use 40% similarity threshold for Enhanced mode. Is this too permissive? Too restrictive?

  3. Valence/Arousal Derivation: Our formulas for deriving valence/arousal from mood predictions are hand-tuned. Are the weights reasonable?

  4. BPM Normalization: We normalize BPM to 60-180 range. Should we use octave-aware BPM (treating 60 and 120 as similar)?

  5. Cross-Genre Matching: The algorithm matches based on audio similarity regardless of genre. Should genre matching have more weight?

  6. Cold Start: Tracks with missing analysis fall back to 0.5 for all features. Should they be excluded from matching?


Dependencies

Python (Audio Analyzer)

essentia==2.1b6.dev1110
essentia-tensorflow==2.1b6.dev1110
numpy>=1.21.0,<2.0.0
tensorflow==2.15.0
redis>=4.5.0
psycopg2-binary>=2.9.0

MusiCNN Models (Essentia Model Zoo)

  • msd-musicnn-1.pb - Base embedding model (~3MB)
  • mood_happy-msd-musicnn-1.pb - Happy classifier
  • mood_sad-msd-musicnn-1.pb - Sad classifier
  • mood_relaxed-msd-musicnn-1.pb - Relaxed classifier
  • mood_aggressive-msd-musicnn-1.pb - Aggressive classifier
  • mood_party-msd-musicnn-1.pb - Party classifier
  • mood_acoustic-msd-musicnn-1.pb - Acoustic classifier
  • mood_electronic-msd-musicnn-1.pb - Electronic classifier
  • danceability-msd-musicnn-1.pb - Danceability classifier
  • voice_instrumental-msd-musicnn-1.pb - Voice/instrumental classifier

References


File Locations

Component Path
Audio Analyzer services/audio-analyzer/analyzer.py
Vibe Matching backend/src/routes/library.ts (lines 3293-3580)
Database Schema backend/prisma/schema.prisma
Standard Mode Docs docs/implementation-summaries/audio-analysis-standard-mode/README.md
Enhanced Mode Docs docs/implementation-summaries/audio-analysis-standard-mode/ENHANCED_MODE.md
Algorithm Overview docs/implementation-summaries/vibe-matching-overhaul/README.md