# Vibe Matching Implementation Plan ## Executive Summary The current vibe matching system uses Essentia for audio analysis but only extracts **basic features**. Critical mood/emotion features are either placeholder values or poorly estimated. This document outlines a comprehensive plan to achieve Spotify-quality vibe matching while being conscious of performance on user hardware. ## Strategy Update (Latest) **Default:** Enhanced mode (ML-powered, accurate) **Fallback:** Standard mode (lightweight, for troubleshooting or power saving) **Approach:** 1. ✅ Pre-package all Essentia TensorFlow models in Docker image (~200MB) 2. 🔄 Fix Enhanced mode FIRST - make it actually use the ML models 3. ⏳ THEN create Standard mode as a lightweight fallback 4. Users can toggle to Standard mode to save CPU if needed --- ## Current State Analysis ### What Essentia IS Currently Extracting (Working) | Feature | Status | Quality | |---------|--------|---------| | **BPM** | ✅ Working | Good - Uses `RhythmExtractor2013` | | **Key** | ✅ Working | Good - Uses `KeyExtractor` | | **KeyScale** | ✅ Working | Good - major/minor detection | | **Energy** | ✅ Working | Moderate - Raw energy normalized | | **Loudness** | ✅ Working | Good - dB measurement | | **Dynamic Range** | ✅ Working | Good | | **Danceability** | ✅ Working | Good - Uses `Danceability` algorithm | | **Beats Count** | ✅ Working | Good | ### What's Broken or Placeholder | Feature | Status | Problem | |---------|--------|---------| | **Valence** | ⚠️ Fake | Calculated as `(major/minor * 0.4) + (energy * 0.6)` - NOT actual emotional valence | | **Arousal** | ⚠️ Fake | Calculated as `(BPM * 0.5) + (energy * 0.5)` - NOT actual arousal | | **Instrumentalness** | ❌ Placeholder | Hardcoded to `0.5` | | **Acousticness** | ⚠️ Estimate | Rough estimate from dynamic range | | **Speechiness** | ❌ Placeholder | Hardcoded to `0.1` | | **Mood Tags** | ⚠️ Derived | Generated from fake valence/arousal, not ML | | **Genre Tags** | ❌ Empty | TensorFlow models not loaded | ### The Core Issue ```python # Current valence calculation (analyzer.py lines 226-231) key_valence = 0.6 if scale == 'major' else 0.4 energy_valence = result['energy'] result['valence'] = round((key_valence * 0.4 + energy_valence * 0.6), 3) ``` **"Fake Happy" by Paramore** (emotionally complex, about masking sadness): - Major key → 0.6 - High energy → ~0.7 - Calculated valence: `(0.6 * 0.4) + (0.7 * 0.6) = 0.66` (appears "happy") **"Summer Girl" by Jamiroquai** (genuinely upbeat funk): - Major key → 0.6 - High energy → ~0.7 - Calculated valence: `(0.6 * 0.4) + (0.7 * 0.6) = 0.66` (appears "happy") **Result: 97% match despite being completely different vibes!** --- ## How Spotify Does It Spotify's audio analysis uses a combination of: ### 1. Low-Level Audio Features (Similar to what we have) - Tempo/BPM - Key/Mode - Loudness - Time signature ### 2. Mid-Level Features (We're missing these) - **Spectral Centroid** - "brightness" of the sound - **Spectral Rolloff** - frequency distribution - **Zero Crossing Rate** - percussiveness - **MFCCs** - Mel-frequency cepstral coefficients (timbral texture) - **Chroma Features** - harmonic content ### 3. High-Level Features (We're faking these) - **Valence** - Musical positiveness (0-1) - **Arousal/Energy** - Intensity and activity - **Instrumentalness** - Vocal presence prediction - **Acousticness** - Acoustic vs electronic - **Speechiness** - Presence of spoken words - **Liveness** - Audience presence detection ### 4. Deep Learning Models Spotify trains neural networks on millions of labeled tracks to predict: - Mood categories - Genre classification - User preference patterns --- ## Two-Tier System ### Default: Enhanced Vibe Matching (ML-Powered) **Status:** DEFAULT - Pre-packaged in Docker, just works **Target:** High accuracy, ~5-10 seconds per track **Features (from Essentia TensorFlow Models):** 1. **Mood Predictions (real ML, not estimated):** - `mood_happy-discogs-effnet-1.pb` - Happiness/positivity 0-1 - `mood_sad-discogs-effnet-1.pb` - Sadness 0-1 - `mood_relaxed-discogs-effnet-1.pb` - Relaxation/calmness 0-1 - `mood_aggressive-discogs-effnet-1.pb` - Aggression/intensity 0-1 2. **Audio Characteristics:** - `danceability-discogs-effnet-1.pb` - ML-based danceability - `voice_instrumental-discogs-effnet-1.pb` - Vocal detection (instrumentalness) 3. **Embeddings for Similarity:** - `discogs-effnet-bs64-1.pb` - Audio embeddings (neural "fingerprint") - Can be used for direct similarity comparison 4. **Spectral Features:** - Spectral Centroid (brightness) - MFCCs (timbral texture - 13 coefficients) **Models Pre-packaged:** ~200MB in Docker image (no user download) **RAM Requirement:** ~500MB during analysis **CPU Requirement:** Any modern CPU (2015+) ### Fallback: Standard Vibe Matching (Lightweight) **Status:** FALLBACK - For troubleshooting or power saving **Target:** Fast, <2 seconds per track, low CPU **Features Used:** - BPM (Essentia RhythmExtractor) - Energy (Essentia Energy) - Danceability (Essentia Danceability - non-ML version) - Key/Scale (Essentia KeyExtractor) - Spectral Centroid (cheap to compute) - Last.fm mood tags - Genre matching from tags **When to use Standard mode:** - Low-power devices (Raspberry Pi, older NAS) - Troubleshooting if Enhanced mode has issues - User preference to save CPU cycles --- ## Implementation Plan ### Phase 1: Pre-Package Models in Docker (Day 1) #### 1.1 Update Dockerfile to Include Models ```dockerfile # Download Essentia ML models during build (~200MB) RUN apt-get update && apt-get install -y --no-install-recommends curl && \ # Base embedding model (required for all predictions) curl -L -o /app/models/discogs-effnet-bs64-1.pb \ "https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb" && \ # Mood models curl -L -o /app/models/mood_happy-discogs-effnet-1.pb \ "https://essentia.upf.edu/models/classification-heads/mood_happy/mood_happy-discogs-effnet-1.pb" && \ curl -L -o /app/models/mood_sad-discogs-effnet-1.pb \ "https://essentia.upf.edu/models/classification-heads/mood_sad/mood_sad-discogs-effnet-1.pb" && \ curl -L -o /app/models/mood_relaxed-discogs-effnet-1.pb \ "https://essentia.upf.edu/models/classification-heads/mood_relaxed/mood_relaxed-discogs-effnet-1.pb" && \ curl -L -o /app/models/mood_aggressive-discogs-effnet-1.pb \ "https://essentia.upf.edu/models/classification-heads/mood_aggressive/mood_aggressive-discogs-effnet-1.pb" && \ # Danceability and voice/instrumental curl -L -o /app/models/danceability-discogs-effnet-1.pb \ "https://essentia.upf.edu/models/classification-heads/danceability/danceability-discogs-effnet-1.pb" && \ curl -L -o /app/models/voice_instrumental-discogs-effnet-1.pb \ "https://essentia.upf.edu/models/classification-heads/voice_instrumental/voice_instrumental-discogs-effnet-1.pb" && \ # Arousal/Valence models curl -L -o /app/models/arousal-discogs-effnet-1.pb \ "https://essentia.upf.edu/models/classification-heads/mood_arousal/mood_arousal-discogs-effnet-1.pb" && \ curl -L -o /app/models/valence-discogs-effnet-1.pb \ "https://essentia.upf.edu/models/classification-heads/mood_valence/mood_valence-discogs-effnet-1.pb" && \ apt-get purge -y curl && rm -rf /var/lib/apt/lists/* ``` ### Phase 2: Implement Enhanced Analysis (Days 2-4) #### 2.1 Rewrite analyzer.py with ML Models ```python class AudioAnalyzer: """Enhanced audio analysis using Essentia TensorFlow models""" def __init__(self): self.models_loaded = False self.embedding_model = None self.mood_models = {} if ESSENTIA_AVAILABLE: self._init_essentia() self._load_ml_models() def _load_ml_models(self): """Load TensorFlow models for enhanced analysis""" try: from essentia.standard import ( TensorflowPredictEffnetDiscogs, TensorflowPredict2D ) # Load embedding extractor (base for all predictions) embedding_path = '/app/models/discogs-effnet-bs64-1.pb' if os.path.exists(embedding_path): self.embedding_model = TensorflowPredictEffnetDiscogs( graphFilename=embedding_path, output="PartitionedCall:1" ) logger.info("Loaded embedding model") # Load mood prediction models mood_models = { 'happy': '/app/models/mood_happy-discogs-effnet-1.pb', 'sad': '/app/models/mood_sad-discogs-effnet-1.pb', 'relaxed': '/app/models/mood_relaxed-discogs-effnet-1.pb', 'aggressive': '/app/models/mood_aggressive-discogs-effnet-1.pb', 'danceability': '/app/models/danceability-discogs-effnet-1.pb', 'voice_instrumental': '/app/models/voice_instrumental-discogs-effnet-1.pb', 'arousal': '/app/models/arousal-discogs-effnet-1.pb', 'valence': '/app/models/valence-discogs-effnet-1.pb', } for name, path in mood_models.items(): if os.path.exists(path): self.mood_models[name] = TensorflowPredict2D( graphFilename=path, output="model/Softmax" ) logger.info(f"Loaded {name} model") self.models_loaded = len(self.mood_models) > 0 logger.info(f"ML models loaded: {self.models_loaded} ({len(self.mood_models)} models)") except Exception as e: logger.warning(f"Could not load ML models: {e}") self.models_loaded = False def analyze(self, file_path: str) -> Dict[str, Any]: """Full analysis with ML models if available""" result = self._extract_basic_features(file_path) if self.models_loaded: ml_features = self._extract_ml_features(file_path) result.update(ml_features) result['analysisMode'] = 'enhanced' else: # Fallback to estimated values result.update(self._estimate_mood_features(result)) result['analysisMode'] = 'standard' return result def _extract_ml_features(self, file_path: str) -> Dict[str, Any]: """Extract features using TensorFlow models""" result = {} # Load audio at 16kHz for ML models audio = self.load_audio(file_path, sample_rate=16000) if audio is None: return result # Get embeddings embeddings = self.embedding_model(audio) # Mood predictions if 'happy' in self.mood_models: preds = self.mood_models['happy'](embeddings) result['moodHappy'] = float(np.mean(preds[:, 1])) # Probability of "happy" if 'sad' in self.mood_models: preds = self.mood_models['sad'](embeddings) result['moodSad'] = float(np.mean(preds[:, 1])) if 'relaxed' in self.mood_models: preds = self.mood_models['relaxed'](embeddings) result['moodRelaxed'] = float(np.mean(preds[:, 1])) if 'aggressive' in self.mood_models: preds = self.mood_models['aggressive'](embeddings) result['moodAggressive'] = float(np.mean(preds[:, 1])) # Real valence and arousal from dedicated models if 'valence' in self.mood_models: preds = self.mood_models['valence'](embeddings) result['valence'] = float(np.mean(preds[:, 1])) if 'arousal' in self.mood_models: preds = self.mood_models['arousal'](embeddings) result['arousal'] = float(np.mean(preds[:, 1])) # Instrumentalness from voice/instrumental model if 'voice_instrumental' in self.mood_models: preds = self.mood_models['voice_instrumental'](embeddings) result['instrumentalness'] = float(np.mean(preds[:, 1])) # 1 = instrumental # ML-based danceability if 'danceability' in self.mood_models: preds = self.mood_models['danceability'](embeddings) result['danceabilityMl'] = float(np.mean(preds[:, 1])) return result ``` ### Phase 3: Update Database Schema (Day 3) #### 3.1 Add New Feature Columns ```prisma model Track { // ... existing fields ... // ML-based mood predictions (Enhanced mode) moodHappy Float? // ML prediction 0-1 moodSad Float? // ML prediction 0-1 moodRelaxed Float? // ML prediction 0-1 moodAggressive Float? // ML prediction 0-1 danceabilityMl Float? // ML-based danceability // Analysis metadata analysisMode String? // 'standard' or 'enhanced' } ``` ### Phase 4: Update Vibe Matching Algorithm (Day 4) #### 4.1 Use Real Mood Predictions in Matching ```typescript // In library.ts - Enhanced vibe matching const scored = analyzedTracks.map(t => { let score = 0; let factors = 0; // === MOOD MATCHING (50% total - the heart of vibe) === // Happy mood (15%) if (sourceTrack.moodHappy !== null && t.moodHappy !== null) { score += (1 - Math.abs(sourceTrack.moodHappy - t.moodHappy)) * 0.15; factors += 0.15; } // Sad mood (10%) if (sourceTrack.moodSad !== null && t.moodSad !== null) { score += (1 - Math.abs(sourceTrack.moodSad - t.moodSad)) * 0.10; factors += 0.10; } // Relaxed mood (10%) if (sourceTrack.moodRelaxed !== null && t.moodRelaxed !== null) { score += (1 - Math.abs(sourceTrack.moodRelaxed - t.moodRelaxed)) * 0.10; factors += 0.10; } // Aggressive mood (10%) if (sourceTrack.moodAggressive !== null && t.moodAggressive !== null) { score += (1 - Math.abs(sourceTrack.moodAggressive - t.moodAggressive)) * 0.10; factors += 0.10; } // Valence - overall positivity (5%) if (sourceTrack.valence !== null && t.valence !== null) { score += (1 - Math.abs(sourceTrack.valence - t.valence)) * 0.05; factors += 0.05; } // === AUDIO CHARACTERISTICS (35% total) === // BPM (15%) - within ±15 BPM is good if (sourceTrack.bpm && t.bpm) { const bpmDiff = Math.abs(sourceTrack.bpm - t.bpm); score += Math.max(0, 1 - bpmDiff / 30) * 0.15; factors += 0.15; } // Energy (10%) if (sourceTrack.energy !== null && t.energy !== null) { score += (1 - Math.abs(sourceTrack.energy - t.energy)) * 0.10; factors += 0.10; } // Danceability - prefer ML version (10%) const srcDance = sourceTrack.danceabilityMl ?? sourceTrack.danceability; const tDance = t.danceabilityMl ?? t.danceability; if (srcDance !== null && tDance !== null) { score += (1 - Math.abs(srcDance - tDance)) * 0.10; factors += 0.10; } // === GENRE/TAGS (15% total) === // Genre/tag overlap (10%) const sourceGenres = [...(sourceTrack.lastfmTags || []), ...(sourceTrack.essentiaGenres || [])]; const trackGenres = [...(t.lastfmTags || []), ...(t.essentiaGenres || [])]; if (sourceGenres.length > 0 && trackGenres.length > 0) { const overlap = sourceGenres.filter(g => trackGenres.includes(g)).length; const maxOverlap = Math.max(sourceGenres.length, trackGenres.length); score += (overlap / maxOverlap) * 0.10; factors += 0.10; } // Key compatibility (5%) if (sourceTrack.keyScale && t.keyScale) { score += (sourceTrack.keyScale === t.keyScale ? 1 : 0.5) * 0.05; factors += 0.05; } const finalScore = factors > 0 ? score / factors : 0; return { id: t.id, score: finalScore }; }); ``` ### Phase 5: Create Standard Mode Fallback (Day 5) After Enhanced mode is working, implement Standard mode: - Same algorithm structure but skip ML features - Use estimated valence (improved heuristics) - Lower weights on mood matching since it's estimated - Higher weights on BPM, energy, genre tags ### Phase 6: Settings & UI (Day 6) #### 6.1 Add Settings Toggle ```typescript // System settings - Enhanced is DEFAULT { audioAnalysis: { vibeMatchingMode: 'enhanced' | 'standard', // Default: 'enhanced' reanalyzeOnModeChange: boolean, // Default: false } } ``` #### 6.2 Settings UI ``` Audio Analysis ├── Vibe Matching Mode │ ├── ● Enhanced (Recommended - Default) │ │ └── Uses ML models for accurate mood detection │ └── ○ Standard (Power Saver) │ └── Faster, uses basic audio features only │ ├── Analysis Status │ └── "1,234 / 1,500 tracks analyzed (Enhanced mode)" │ └── [Re-analyze Library] button └── "Re-analyze all tracks with current settings" ``` ### Phase 7: Testing & Validation (Day 7) #### 7.1 Test Cases | Source Track | Bad Match (Current) | Expected Good Match | |--------------|---------------------|---------------------| | "Fake Happy" (Paramore) | "Summer Girl" (Jamiroquai) 97% | Other emo/pop-punk <60% | | "Creep" (Radiohead) | Fast dance track | Other melancholic rock | | "Uptown Funk" | Slow ballad | Other high-energy funk/pop | #### 7.2 Performance Testing - Analyze 100 tracks, measure time - Memory usage during analysis - Queue handling under load --- ## Database Schema Updates ```prisma model Track { // ... existing fields ... // ML-based mood predictions (Enhanced mode) moodHappy Float? // ML prediction 0-1 moodSad Float? // ML prediction 0-1 moodRelaxed Float? // ML prediction 0-1 moodAggressive Float? // ML prediction 0-1 danceabilityMl Float? // ML-based danceability // Analysis metadata analysisMode String? // 'standard' or 'enhanced' } ``` --- ## Performance Benchmarks (Estimated) | Operation | Standard Mode | Enhanced Mode | |-----------|---------------|---------------| | Analysis per track | 1-2 sec | 5-10 sec | | RAM usage | ~100MB | ~500MB | | Models in Docker | N/A | ~200MB (pre-packaged) | | Vibe match query | <100ms | <100ms | | Full library (1000 tracks) | ~30 min | ~2-3 hours | --- ## Files to Modify | File | Changes | |------|---------| | `services/audio-analyzer/Dockerfile` | Add model downloads during build | | `services/audio-analyzer/analyzer.py` | Implement ML model loading and prediction | | `backend/prisma/schema.prisma` | Add mood prediction columns | | `backend/src/routes/library.ts` | Update vibe matching algorithm weights | | `frontend/features/settings/` | Add analysis mode toggle (default: enhanced) | | `frontend/components/player/VibeGraph.tsx` | Display mood predictions | --- ## Success Metrics After implementation, "Fake Happy" and "Summer Girl" should: - Match at **<50%** (different emotional content, different genre) Better matches for "Fake Happy" would be: - Other Paramore songs (same artist = genre/production match) - Emo/pop-punk with similar emotional complexity - Songs with high energy but mixed emotional signals --- ## Implementation Order (Enhanced First) ### Week 1: Get Enhanced Mode Working 1. [x] Create implementation plan (this document) 2. [x] Update Dockerfile to pre-package ML models (~200MB) 3. [x] Rewrite analyzer.py with TensorFlow model loading 4. [x] Add new database columns for mood predictions (moodHappy, moodSad, etc.) 5. [x] Update vibe matching algorithm with ML mood weights 6. [x] Update programmatic playlists to use ML mood predictions 7. [ ] Run Prisma migration to apply schema changes 8. [ ] Rebuild audio-analyzer Docker container 9. [ ] Test ML analysis on sample tracks ### Week 2: Polish & Fallback 10. [ ] Test accuracy with diverse track pairs 11. [ ] Add settings UI (Enhanced = default) 12. [ ] Implement Standard mode as explicit fallback option 13. [ ] Update VibeGraph to show mood predictions 14. [ ] Documentation and testing --- ## Quick Reference: Models to Include | Model | File | Purpose | Size | |-------|------|---------|------| | Embeddings | `discogs-effnet-bs64-1.pb` | Base model for all predictions | ~85MB | | Happy | `mood_happy-discogs-effnet-1.pb` | Happiness detection | ~15MB | | Sad | `mood_sad-discogs-effnet-1.pb` | Sadness detection | ~15MB | | Relaxed | `mood_relaxed-discogs-effnet-1.pb` | Relaxation detection | ~15MB | | Aggressive | `mood_aggressive-discogs-effnet-1.pb` | Aggression detection | ~15MB | | Arousal | `mood_arousal-discogs-effnet-1.pb` | Energy/calm scale | ~15MB | | Valence | `mood_valence-discogs-effnet-1.pb` | Positive/negative | ~15MB | | Danceability | `danceability-discogs-effnet-1.pb` | ML danceability | ~15MB | | Voice/Instrumental | `voice_instrumental-discogs-effnet-1.pb` | Vocal detection | ~15MB | **Total:** ~200MB (one-time addition to Docker image)