Files
lidify/docs/implementation-summaries/vibe-matching/IMPLEMENTATION_PLAN.md
2025-12-25 18:58:06 -06:00

21 KiB

Vibe Matching Implementation Plan

Executive Summary

The current vibe matching system uses Essentia for audio analysis but only extracts basic features. Critical mood/emotion features are either placeholder values or poorly estimated. This document outlines a comprehensive plan to achieve Spotify-quality vibe matching while being conscious of performance on user hardware.

Strategy Update (Latest)

Default: Enhanced mode (ML-powered, accurate)
Fallback: Standard mode (lightweight, for troubleshooting or power saving)

Approach:

  1. Pre-package all Essentia TensorFlow models in Docker image (~200MB)
  2. 🔄 Fix Enhanced mode FIRST - make it actually use the ML models
  3. THEN create Standard mode as a lightweight fallback
  4. Users can toggle to Standard mode to save CPU if needed

Current State Analysis

What Essentia IS Currently Extracting (Working)

Feature Status Quality
BPM Working Good - Uses RhythmExtractor2013
Key Working Good - Uses KeyExtractor
KeyScale Working Good - major/minor detection
Energy Working Moderate - Raw energy normalized
Loudness Working Good - dB measurement
Dynamic Range Working Good
Danceability Working Good - Uses Danceability algorithm
Beats Count Working Good

What's Broken or Placeholder

Feature Status Problem
Valence ⚠️ Fake Calculated as (major/minor * 0.4) + (energy * 0.6) - NOT actual emotional valence
Arousal ⚠️ Fake Calculated as (BPM * 0.5) + (energy * 0.5) - NOT actual arousal
Instrumentalness Placeholder Hardcoded to 0.5
Acousticness ⚠️ Estimate Rough estimate from dynamic range
Speechiness Placeholder Hardcoded to 0.1
Mood Tags ⚠️ Derived Generated from fake valence/arousal, not ML
Genre Tags Empty TensorFlow models not loaded

The Core Issue

# Current valence calculation (analyzer.py lines 226-231)
key_valence = 0.6 if scale == 'major' else 0.4
energy_valence = result['energy']
result['valence'] = round((key_valence * 0.4 + energy_valence * 0.6), 3)

"Fake Happy" by Paramore (emotionally complex, about masking sadness):

  • Major key → 0.6
  • High energy → ~0.7
  • Calculated valence: (0.6 * 0.4) + (0.7 * 0.6) = 0.66 (appears "happy")

"Summer Girl" by Jamiroquai (genuinely upbeat funk):

  • Major key → 0.6
  • High energy → ~0.7
  • Calculated valence: (0.6 * 0.4) + (0.7 * 0.6) = 0.66 (appears "happy")

Result: 97% match despite being completely different vibes!


How Spotify Does It

Spotify's audio analysis uses a combination of:

1. Low-Level Audio Features (Similar to what we have)

  • Tempo/BPM
  • Key/Mode
  • Loudness
  • Time signature

2. Mid-Level Features (We're missing these)

  • Spectral Centroid - "brightness" of the sound
  • Spectral Rolloff - frequency distribution
  • Zero Crossing Rate - percussiveness
  • MFCCs - Mel-frequency cepstral coefficients (timbral texture)
  • Chroma Features - harmonic content

3. High-Level Features (We're faking these)

  • Valence - Musical positiveness (0-1)
  • Arousal/Energy - Intensity and activity
  • Instrumentalness - Vocal presence prediction
  • Acousticness - Acoustic vs electronic
  • Speechiness - Presence of spoken words
  • Liveness - Audience presence detection

4. Deep Learning Models

Spotify trains neural networks on millions of labeled tracks to predict:

  • Mood categories
  • Genre classification
  • User preference patterns

Two-Tier System

Default: Enhanced Vibe Matching (ML-Powered)

Status: DEFAULT - Pre-packaged in Docker, just works
Target: High accuracy, ~5-10 seconds per track

Features (from Essentia TensorFlow Models):

  1. Mood Predictions (real ML, not estimated):

    • mood_happy-discogs-effnet-1.pb - Happiness/positivity 0-1
    • mood_sad-discogs-effnet-1.pb - Sadness 0-1
    • mood_relaxed-discogs-effnet-1.pb - Relaxation/calmness 0-1
    • mood_aggressive-discogs-effnet-1.pb - Aggression/intensity 0-1
  2. Audio Characteristics:

    • danceability-discogs-effnet-1.pb - ML-based danceability
    • voice_instrumental-discogs-effnet-1.pb - Vocal detection (instrumentalness)
  3. Embeddings for Similarity:

    • discogs-effnet-bs64-1.pb - Audio embeddings (neural "fingerprint")
    • Can be used for direct similarity comparison
  4. Spectral Features:

    • Spectral Centroid (brightness)
    • MFCCs (timbral texture - 13 coefficients)

Models Pre-packaged: ~200MB in Docker image (no user download)
RAM Requirement: ~500MB during analysis
CPU Requirement: Any modern CPU (2015+)

Fallback: Standard Vibe Matching (Lightweight)

Status: FALLBACK - For troubleshooting or power saving
Target: Fast, <2 seconds per track, low CPU

Features Used:

  • BPM (Essentia RhythmExtractor)
  • Energy (Essentia Energy)
  • Danceability (Essentia Danceability - non-ML version)
  • Key/Scale (Essentia KeyExtractor)
  • Spectral Centroid (cheap to compute)
  • Last.fm mood tags
  • Genre matching from tags

When to use Standard mode:

  • Low-power devices (Raspberry Pi, older NAS)
  • Troubleshooting if Enhanced mode has issues
  • User preference to save CPU cycles

Implementation Plan

Phase 1: Pre-Package Models in Docker (Day 1)

1.1 Update Dockerfile to Include Models

# Download Essentia ML models during build (~200MB)
RUN apt-get update && apt-get install -y --no-install-recommends curl && \
    # Base embedding model (required for all predictions)
    curl -L -o /app/models/discogs-effnet-bs64-1.pb \
        "https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb" && \
    # Mood models
    curl -L -o /app/models/mood_happy-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/mood_happy/mood_happy-discogs-effnet-1.pb" && \
    curl -L -o /app/models/mood_sad-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/mood_sad/mood_sad-discogs-effnet-1.pb" && \
    curl -L -o /app/models/mood_relaxed-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/mood_relaxed/mood_relaxed-discogs-effnet-1.pb" && \
    curl -L -o /app/models/mood_aggressive-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/mood_aggressive/mood_aggressive-discogs-effnet-1.pb" && \
    # Danceability and voice/instrumental
    curl -L -o /app/models/danceability-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/danceability/danceability-discogs-effnet-1.pb" && \
    curl -L -o /app/models/voice_instrumental-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/voice_instrumental/voice_instrumental-discogs-effnet-1.pb" && \
    # Arousal/Valence models
    curl -L -o /app/models/arousal-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/mood_arousal/mood_arousal-discogs-effnet-1.pb" && \
    curl -L -o /app/models/valence-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/mood_valence/mood_valence-discogs-effnet-1.pb" && \
    apt-get purge -y curl && rm -rf /var/lib/apt/lists/*

Phase 2: Implement Enhanced Analysis (Days 2-4)

2.1 Rewrite analyzer.py with ML Models

class AudioAnalyzer:
    """Enhanced audio analysis using Essentia TensorFlow models"""
    
    def __init__(self):
        self.models_loaded = False
        self.embedding_model = None
        self.mood_models = {}
        
        if ESSENTIA_AVAILABLE:
            self._init_essentia()
            self._load_ml_models()
    
    def _load_ml_models(self):
        """Load TensorFlow models for enhanced analysis"""
        try:
            from essentia.standard import (
                TensorflowPredictEffnetDiscogs,
                TensorflowPredict2D
            )
            
            # Load embedding extractor (base for all predictions)
            embedding_path = '/app/models/discogs-effnet-bs64-1.pb'
            if os.path.exists(embedding_path):
                self.embedding_model = TensorflowPredictEffnetDiscogs(
                    graphFilename=embedding_path,
                    output="PartitionedCall:1"
                )
                logger.info("Loaded embedding model")
            
            # Load mood prediction models
            mood_models = {
                'happy': '/app/models/mood_happy-discogs-effnet-1.pb',
                'sad': '/app/models/mood_sad-discogs-effnet-1.pb',
                'relaxed': '/app/models/mood_relaxed-discogs-effnet-1.pb',
                'aggressive': '/app/models/mood_aggressive-discogs-effnet-1.pb',
                'danceability': '/app/models/danceability-discogs-effnet-1.pb',
                'voice_instrumental': '/app/models/voice_instrumental-discogs-effnet-1.pb',
                'arousal': '/app/models/arousal-discogs-effnet-1.pb',
                'valence': '/app/models/valence-discogs-effnet-1.pb',
            }
            
            for name, path in mood_models.items():
                if os.path.exists(path):
                    self.mood_models[name] = TensorflowPredict2D(
                        graphFilename=path,
                        output="model/Softmax"
                    )
                    logger.info(f"Loaded {name} model")
            
            self.models_loaded = len(self.mood_models) > 0
            logger.info(f"ML models loaded: {self.models_loaded} ({len(self.mood_models)} models)")
            
        except Exception as e:
            logger.warning(f"Could not load ML models: {e}")
            self.models_loaded = False
    
    def analyze(self, file_path: str) -> Dict[str, Any]:
        """Full analysis with ML models if available"""
        result = self._extract_basic_features(file_path)
        
        if self.models_loaded:
            ml_features = self._extract_ml_features(file_path)
            result.update(ml_features)
            result['analysisMode'] = 'enhanced'
        else:
            # Fallback to estimated values
            result.update(self._estimate_mood_features(result))
            result['analysisMode'] = 'standard'
        
        return result
    
    def _extract_ml_features(self, file_path: str) -> Dict[str, Any]:
        """Extract features using TensorFlow models"""
        result = {}
        
        # Load audio at 16kHz for ML models
        audio = self.load_audio(file_path, sample_rate=16000)
        if audio is None:
            return result
        
        # Get embeddings
        embeddings = self.embedding_model(audio)
        
        # Mood predictions
        if 'happy' in self.mood_models:
            preds = self.mood_models['happy'](embeddings)
            result['moodHappy'] = float(np.mean(preds[:, 1]))  # Probability of "happy"
        
        if 'sad' in self.mood_models:
            preds = self.mood_models['sad'](embeddings)
            result['moodSad'] = float(np.mean(preds[:, 1]))
        
        if 'relaxed' in self.mood_models:
            preds = self.mood_models['relaxed'](embeddings)
            result['moodRelaxed'] = float(np.mean(preds[:, 1]))
        
        if 'aggressive' in self.mood_models:
            preds = self.mood_models['aggressive'](embeddings)
            result['moodAggressive'] = float(np.mean(preds[:, 1]))
        
        # Real valence and arousal from dedicated models
        if 'valence' in self.mood_models:
            preds = self.mood_models['valence'](embeddings)
            result['valence'] = float(np.mean(preds[:, 1]))
        
        if 'arousal' in self.mood_models:
            preds = self.mood_models['arousal'](embeddings)
            result['arousal'] = float(np.mean(preds[:, 1]))
        
        # Instrumentalness from voice/instrumental model
        if 'voice_instrumental' in self.mood_models:
            preds = self.mood_models['voice_instrumental'](embeddings)
            result['instrumentalness'] = float(np.mean(preds[:, 1]))  # 1 = instrumental
        
        # ML-based danceability
        if 'danceability' in self.mood_models:
            preds = self.mood_models['danceability'](embeddings)
            result['danceabilityMl'] = float(np.mean(preds[:, 1]))
        
        return result

Phase 3: Update Database Schema (Day 3)

3.1 Add New Feature Columns

model Track {
  // ... existing fields ...
  
  // ML-based mood predictions (Enhanced mode)
  moodHappy       Float?  // ML prediction 0-1
  moodSad         Float?  // ML prediction 0-1
  moodRelaxed     Float?  // ML prediction 0-1
  moodAggressive  Float?  // ML prediction 0-1
  danceabilityMl  Float?  // ML-based danceability
  
  // Analysis metadata
  analysisMode    String? // 'standard' or 'enhanced'
}

Phase 4: Update Vibe Matching Algorithm (Day 4)

4.1 Use Real Mood Predictions in Matching

// In library.ts - Enhanced vibe matching
const scored = analyzedTracks.map(t => {
    let score = 0;
    let factors = 0;
    
    // === MOOD MATCHING (50% total - the heart of vibe) ===
    
    // Happy mood (15%)
    if (sourceTrack.moodHappy !== null && t.moodHappy !== null) {
        score += (1 - Math.abs(sourceTrack.moodHappy - t.moodHappy)) * 0.15;
        factors += 0.15;
    }
    
    // Sad mood (10%)
    if (sourceTrack.moodSad !== null && t.moodSad !== null) {
        score += (1 - Math.abs(sourceTrack.moodSad - t.moodSad)) * 0.10;
        factors += 0.10;
    }
    
    // Relaxed mood (10%)
    if (sourceTrack.moodRelaxed !== null && t.moodRelaxed !== null) {
        score += (1 - Math.abs(sourceTrack.moodRelaxed - t.moodRelaxed)) * 0.10;
        factors += 0.10;
    }
    
    // Aggressive mood (10%)
    if (sourceTrack.moodAggressive !== null && t.moodAggressive !== null) {
        score += (1 - Math.abs(sourceTrack.moodAggressive - t.moodAggressive)) * 0.10;
        factors += 0.10;
    }
    
    // Valence - overall positivity (5%)
    if (sourceTrack.valence !== null && t.valence !== null) {
        score += (1 - Math.abs(sourceTrack.valence - t.valence)) * 0.05;
        factors += 0.05;
    }
    
    // === AUDIO CHARACTERISTICS (35% total) ===
    
    // BPM (15%) - within ±15 BPM is good
    if (sourceTrack.bpm && t.bpm) {
        const bpmDiff = Math.abs(sourceTrack.bpm - t.bpm);
        score += Math.max(0, 1 - bpmDiff / 30) * 0.15;
        factors += 0.15;
    }
    
    // Energy (10%)
    if (sourceTrack.energy !== null && t.energy !== null) {
        score += (1 - Math.abs(sourceTrack.energy - t.energy)) * 0.10;
        factors += 0.10;
    }
    
    // Danceability - prefer ML version (10%)
    const srcDance = sourceTrack.danceabilityMl ?? sourceTrack.danceability;
    const tDance = t.danceabilityMl ?? t.danceability;
    if (srcDance !== null && tDance !== null) {
        score += (1 - Math.abs(srcDance - tDance)) * 0.10;
        factors += 0.10;
    }
    
    // === GENRE/TAGS (15% total) ===
    
    // Genre/tag overlap (10%)
    const sourceGenres = [...(sourceTrack.lastfmTags || []), ...(sourceTrack.essentiaGenres || [])];
    const trackGenres = [...(t.lastfmTags || []), ...(t.essentiaGenres || [])];
    if (sourceGenres.length > 0 && trackGenres.length > 0) {
        const overlap = sourceGenres.filter(g => trackGenres.includes(g)).length;
        const maxOverlap = Math.max(sourceGenres.length, trackGenres.length);
        score += (overlap / maxOverlap) * 0.10;
        factors += 0.10;
    }
    
    // Key compatibility (5%)
    if (sourceTrack.keyScale && t.keyScale) {
        score += (sourceTrack.keyScale === t.keyScale ? 1 : 0.5) * 0.05;
        factors += 0.05;
    }
    
    const finalScore = factors > 0 ? score / factors : 0;
    return { id: t.id, score: finalScore };
});

Phase 5: Create Standard Mode Fallback (Day 5)

After Enhanced mode is working, implement Standard mode:

  • Same algorithm structure but skip ML features
  • Use estimated valence (improved heuristics)
  • Lower weights on mood matching since it's estimated
  • Higher weights on BPM, energy, genre tags

Phase 6: Settings & UI (Day 6)

6.1 Add Settings Toggle

// System settings - Enhanced is DEFAULT
{
  audioAnalysis: {
    vibeMatchingMode: 'enhanced' | 'standard',  // Default: 'enhanced'
    reanalyzeOnModeChange: boolean,  // Default: false
  }
}

6.2 Settings UI

Audio Analysis
├── Vibe Matching Mode
│   ├── ● Enhanced (Recommended - Default)
│   │   └── Uses ML models for accurate mood detection
│   └── ○ Standard (Power Saver)
│       └── Faster, uses basic audio features only
│
├── Analysis Status
│   └── "1,234 / 1,500 tracks analyzed (Enhanced mode)"
│
└── [Re-analyze Library] button
    └── "Re-analyze all tracks with current settings"

Phase 7: Testing & Validation (Day 7)

7.1 Test Cases

Source Track Bad Match (Current) Expected Good Match
"Fake Happy" (Paramore) "Summer Girl" (Jamiroquai) 97% Other emo/pop-punk <60%
"Creep" (Radiohead) Fast dance track Other melancholic rock
"Uptown Funk" Slow ballad Other high-energy funk/pop

7.2 Performance Testing

  • Analyze 100 tracks, measure time
  • Memory usage during analysis
  • Queue handling under load

Database Schema Updates

model Track {
  // ... existing fields ...
  
  // ML-based mood predictions (Enhanced mode)
  moodHappy         Float?  // ML prediction 0-1
  moodSad           Float?  // ML prediction 0-1
  moodRelaxed       Float?  // ML prediction 0-1
  moodAggressive    Float?  // ML prediction 0-1
  danceabilityMl    Float?  // ML-based danceability
  
  // Analysis metadata
  analysisMode      String? // 'standard' or 'enhanced'
}

Performance Benchmarks (Estimated)

Operation Standard Mode Enhanced Mode
Analysis per track 1-2 sec 5-10 sec
RAM usage ~100MB ~500MB
Models in Docker N/A ~200MB (pre-packaged)
Vibe match query <100ms <100ms
Full library (1000 tracks) ~30 min ~2-3 hours

Files to Modify

File Changes
services/audio-analyzer/Dockerfile Add model downloads during build
services/audio-analyzer/analyzer.py Implement ML model loading and prediction
backend/prisma/schema.prisma Add mood prediction columns
backend/src/routes/library.ts Update vibe matching algorithm weights
frontend/features/settings/ Add analysis mode toggle (default: enhanced)
frontend/components/player/VibeGraph.tsx Display mood predictions

Success Metrics

After implementation, "Fake Happy" and "Summer Girl" should:

  • Match at <50% (different emotional content, different genre)

Better matches for "Fake Happy" would be:

  • Other Paramore songs (same artist = genre/production match)
  • Emo/pop-punk with similar emotional complexity
  • Songs with high energy but mixed emotional signals

Implementation Order (Enhanced First)

Week 1: Get Enhanced Mode Working

  1. Create implementation plan (this document)
  2. Update Dockerfile to pre-package ML models (~200MB)
  3. Rewrite analyzer.py with TensorFlow model loading
  4. Add new database columns for mood predictions (moodHappy, moodSad, etc.)
  5. Update vibe matching algorithm with ML mood weights
  6. Update programmatic playlists to use ML mood predictions
  7. Run Prisma migration to apply schema changes
  8. Rebuild audio-analyzer Docker container
  9. Test ML analysis on sample tracks

Week 2: Polish & Fallback

  1. Test accuracy with diverse track pairs
  2. Add settings UI (Enhanced = default)
  3. Implement Standard mode as explicit fallback option
  4. Update VibeGraph to show mood predictions
  5. Documentation and testing

Quick Reference: Models to Include

Model File Purpose Size
Embeddings discogs-effnet-bs64-1.pb Base model for all predictions ~85MB
Happy mood_happy-discogs-effnet-1.pb Happiness detection ~15MB
Sad mood_sad-discogs-effnet-1.pb Sadness detection ~15MB
Relaxed mood_relaxed-discogs-effnet-1.pb Relaxation detection ~15MB
Aggressive mood_aggressive-discogs-effnet-1.pb Aggression detection ~15MB
Arousal mood_arousal-discogs-effnet-1.pb Energy/calm scale ~15MB
Valence mood_valence-discogs-effnet-1.pb Positive/negative ~15MB
Danceability danceability-discogs-effnet-1.pb ML danceability ~15MB
Voice/Instrumental voice_instrumental-discogs-effnet-1.pb Vocal detection ~15MB

Total: ~200MB (one-time addition to Docker image)