Files

Kevin O'Neill 021aec7a63 Initial release v1.0.0

2025-12-25 18:58:06 -06:00

21 KiB

Raw Blame History

Vibe Matching Implementation Plan

Executive Summary

The current vibe matching system uses Essentia for audio analysis but only extracts basic features. Critical mood/emotion features are either placeholder values or poorly estimated. This document outlines a comprehensive plan to achieve Spotify-quality vibe matching while being conscious of performance on user hardware.

Strategy Update (Latest)

Default: Enhanced mode (ML-powered, accurate)
Fallback: Standard mode (lightweight, for troubleshooting or power saving)

Approach:

✅ Pre-package all Essentia TensorFlow models in Docker image (~200MB)
🔄 Fix Enhanced mode FIRST - make it actually use the ML models
⏳ THEN create Standard mode as a lightweight fallback
Users can toggle to Standard mode to save CPU if needed

Current State Analysis

What Essentia IS Currently Extracting (Working)

Feature	Status	Quality
BPM	✅ Working	Good - Uses `RhythmExtractor2013`
Key	✅ Working	Good - Uses `KeyExtractor`
KeyScale	✅ Working	Good - major/minor detection
Energy	✅ Working	Moderate - Raw energy normalized
Loudness	✅ Working	Good - dB measurement
Dynamic Range	✅ Working	Good
Danceability	✅ Working	Good - Uses `Danceability` algorithm
Beats Count	✅ Working	Good

What's Broken or Placeholder

Feature	Status	Problem
Valence	⚠️ Fake	Calculated as `(major/minor * 0.4) + (energy * 0.6)` - NOT actual emotional valence
Arousal	⚠️ Fake	Calculated as `(BPM * 0.5) + (energy * 0.5)` - NOT actual arousal
Instrumentalness	❌ Placeholder	Hardcoded to `0.5`
Acousticness	⚠️ Estimate	Rough estimate from dynamic range
Speechiness	❌ Placeholder	Hardcoded to `0.1`
Mood Tags	⚠️ Derived	Generated from fake valence/arousal, not ML
Genre Tags	❌ Empty	TensorFlow models not loaded

The Core Issue

# Current valence calculation (analyzer.py lines 226-231)
key_valence = 0.6 if scale == 'major' else 0.4
energy_valence = result['energy']
result['valence'] = round((key_valence * 0.4 + energy_valence * 0.6), 3)

"Fake Happy" by Paramore (emotionally complex, about masking sadness):

Major key → 0.6
High energy → ~0.7
Calculated valence: (0.6 * 0.4) + (0.7 * 0.6) = 0.66 (appears "happy")

"Summer Girl" by Jamiroquai (genuinely upbeat funk):

Major key → 0.6
High energy → ~0.7
Calculated valence: (0.6 * 0.4) + (0.7 * 0.6) = 0.66 (appears "happy")

Result: 97% match despite being completely different vibes!

How Spotify Does It

Spotify's audio analysis uses a combination of:

1. Low-Level Audio Features (Similar to what we have)

Tempo/BPM
Key/Mode
Loudness
Time signature

2. Mid-Level Features (We're missing these)

Spectral Centroid - "brightness" of the sound
Spectral Rolloff - frequency distribution
Zero Crossing Rate - percussiveness
MFCCs - Mel-frequency cepstral coefficients (timbral texture)
Chroma Features - harmonic content

3. High-Level Features (We're faking these)

Valence - Musical positiveness (0-1)
Arousal/Energy - Intensity and activity
Instrumentalness - Vocal presence prediction
Acousticness - Acoustic vs electronic
Speechiness - Presence of spoken words
Liveness - Audience presence detection

4. Deep Learning Models

Spotify trains neural networks on millions of labeled tracks to predict:

Mood categories
Genre classification
User preference patterns

Two-Tier System

Default: Enhanced Vibe Matching (ML-Powered)

Status: DEFAULT - Pre-packaged in Docker, just works
Target: High accuracy, ~5-10 seconds per track

Features (from Essentia TensorFlow Models):

Mood Predictions (real ML, not estimated):
- mood_happy-discogs-effnet-1.pb - Happiness/positivity 0-1
- mood_sad-discogs-effnet-1.pb - Sadness 0-1
- mood_relaxed-discogs-effnet-1.pb - Relaxation/calmness 0-1
- mood_aggressive-discogs-effnet-1.pb - Aggression/intensity 0-1
Audio Characteristics:
- danceability-discogs-effnet-1.pb - ML-based danceability
- voice_instrumental-discogs-effnet-1.pb - Vocal detection (instrumentalness)
Embeddings for Similarity:
- discogs-effnet-bs64-1.pb - Audio embeddings (neural "fingerprint")
- Can be used for direct similarity comparison
Spectral Features:
- Spectral Centroid (brightness)
- MFCCs (timbral texture - 13 coefficients)

Models Pre-packaged: ~200MB in Docker image (no user download)
RAM Requirement: ~500MB during analysis
CPU Requirement: Any modern CPU (2015+)

Fallback: Standard Vibe Matching (Lightweight)

Status: FALLBACK - For troubleshooting or power saving
Target: Fast, <2 seconds per track, low CPU

Features Used:

BPM (Essentia RhythmExtractor)
Energy (Essentia Energy)
Danceability (Essentia Danceability - non-ML version)
Key/Scale (Essentia KeyExtractor)
Spectral Centroid (cheap to compute)
Last.fm mood tags
Genre matching from tags

When to use Standard mode:

Low-power devices (Raspberry Pi, older NAS)
Troubleshooting if Enhanced mode has issues
User preference to save CPU cycles

Implementation Plan

Phase 1: Pre-Package Models in Docker (Day 1)

1.1 Update Dockerfile to Include Models

# Download Essentia ML models during build (~200MB)
RUN apt-get update && apt-get install -y --no-install-recommends curl && \
    # Base embedding model (required for all predictions)
    curl -L -o /app/models/discogs-effnet-bs64-1.pb \
        "https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb" && \
    # Mood models
    curl -L -o /app/models/mood_happy-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/mood_happy/mood_happy-discogs-effnet-1.pb" && \
    curl -L -o /app/models/mood_sad-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/mood_sad/mood_sad-discogs-effnet-1.pb" && \
    curl -L -o /app/models/mood_relaxed-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/mood_relaxed/mood_relaxed-discogs-effnet-1.pb" && \
    curl -L -o /app/models/mood_aggressive-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/mood_aggressive/mood_aggressive-discogs-effnet-1.pb" && \
    # Danceability and voice/instrumental
    curl -L -o /app/models/danceability-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/danceability/danceability-discogs-effnet-1.pb" && \
    curl -L -o /app/models/voice_instrumental-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/voice_instrumental/voice_instrumental-discogs-effnet-1.pb" && \
    # Arousal/Valence models
    curl -L -o /app/models/arousal-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/mood_arousal/mood_arousal-discogs-effnet-1.pb" && \
    curl -L -o /app/models/valence-discogs-effnet-1.pb \
        "https://essentia.upf.edu/models/classification-heads/mood_valence/mood_valence-discogs-effnet-1.pb" && \
    apt-get purge -y curl && rm -rf /var/lib/apt/lists/*

Phase 2: Implement Enhanced Analysis (Days 2-4)

2.1 Rewrite analyzer.py with ML Models

class AudioAnalyzer:
    """Enhanced audio analysis using Essentia TensorFlow models"""
    
    def __init__(self):
        self.models_loaded = False
        self.embedding_model = None
        self.mood_models = {}
        
        if ESSENTIA_AVAILABLE:
            self._init_essentia()
            self._load_ml_models()
    
    def _load_ml_models(self):
        """Load TensorFlow models for enhanced analysis"""
        try:
            from essentia.standard import (
                TensorflowPredictEffnetDiscogs,
                TensorflowPredict2D
            )
            
            # Load embedding extractor (base for all predictions)
            embedding_path = '/app/models/discogs-effnet-bs64-1.pb'
            if os.path.exists(embedding_path):
                self.embedding_model = TensorflowPredictEffnetDiscogs(
                    graphFilename=embedding_path,
                    output="PartitionedCall:1"
                )
                logger.info("Loaded embedding model")
            
            # Load mood prediction models
            mood_models = {
                'happy': '/app/models/mood_happy-discogs-effnet-1.pb',
                'sad': '/app/models/mood_sad-discogs-effnet-1.pb',
                'relaxed': '/app/models/mood_relaxed-discogs-effnet-1.pb',
                'aggressive': '/app/models/mood_aggressive-discogs-effnet-1.pb',
                'danceability': '/app/models/danceability-discogs-effnet-1.pb',
                'voice_instrumental': '/app/models/voice_instrumental-discogs-effnet-1.pb',
                'arousal': '/app/models/arousal-discogs-effnet-1.pb',
                'valence': '/app/models/valence-discogs-effnet-1.pb',
            }
            
            for name, path in mood_models.items():
                if os.path.exists(path):
                    self.mood_models[name] = TensorflowPredict2D(
                        graphFilename=path,
                        output="model/Softmax"
                    )
                    logger.info(f"Loaded {name} model")
            
            self.models_loaded = len(self.mood_models) > 0
            logger.info(f"ML models loaded: {self.models_loaded} ({len(self.mood_models)} models)")
            
        except Exception as e:
            logger.warning(f"Could not load ML models: {e}")
            self.models_loaded = False
    
    def analyze(self, file_path: str) -> Dict[str, Any]:
        """Full analysis with ML models if available"""
        result = self._extract_basic_features(file_path)
        
        if self.models_loaded:
            ml_features = self._extract_ml_features(file_path)
            result.update(ml_features)
            result['analysisMode'] = 'enhanced'
        else:
            # Fallback to estimated values
            result.update(self._estimate_mood_features(result))
            result['analysisMode'] = 'standard'
        
        return result
    
    def _extract_ml_features(self, file_path: str) -> Dict[str, Any]:
        """Extract features using TensorFlow models"""
        result = {}
        
        # Load audio at 16kHz for ML models
        audio = self.load_audio(file_path, sample_rate=16000)
        if audio is None:
            return result
        
        # Get embeddings
        embeddings = self.embedding_model(audio)
        
        # Mood predictions
        if 'happy' in self.mood_models:
            preds = self.mood_models['happy'](embeddings)
            result['moodHappy'] = float(np.mean(preds[:, 1]))  # Probability of "happy"
        
        if 'sad' in self.mood_models:
            preds = self.mood_models['sad'](embeddings)
            result['moodSad'] = float(np.mean(preds[:, 1]))
        
        if 'relaxed' in self.mood_models:
            preds = self.mood_models['relaxed'](embeddings)
            result['moodRelaxed'] = float(np.mean(preds[:, 1]))
        
        if 'aggressive' in self.mood_models:
            preds = self.mood_models['aggressive'](embeddings)
            result['moodAggressive'] = float(np.mean(preds[:, 1]))
        
        # Real valence and arousal from dedicated models
        if 'valence' in self.mood_models:
            preds = self.mood_models['valence'](embeddings)
            result['valence'] = float(np.mean(preds[:, 1]))
        
        if 'arousal' in self.mood_models:
            preds = self.mood_models['arousal'](embeddings)
            result['arousal'] = float(np.mean(preds[:, 1]))
        
        # Instrumentalness from voice/instrumental model
        if 'voice_instrumental' in self.mood_models:
            preds = self.mood_models['voice_instrumental'](embeddings)
            result['instrumentalness'] = float(np.mean(preds[:, 1]))  # 1 = instrumental
        
        # ML-based danceability
        if 'danceability' in self.mood_models:
            preds = self.mood_models['danceability'](embeddings)
            result['danceabilityMl'] = float(np.mean(preds[:, 1]))
        
        return result

Phase 3: Update Database Schema (Day 3)

3.1 Add New Feature Columns

model Track {
  // ... existing fields ...
  
  // ML-based mood predictions (Enhanced mode)
  moodHappy       Float?  // ML prediction 0-1
  moodSad         Float?  // ML prediction 0-1
  moodRelaxed     Float?  // ML prediction 0-1
  moodAggressive  Float?  // ML prediction 0-1
  danceabilityMl  Float?  // ML-based danceability
  
  // Analysis metadata
  analysisMode    String? // 'standard' or 'enhanced'
}

Phase 4: Update Vibe Matching Algorithm (Day 4)

4.1 Use Real Mood Predictions in Matching

// In library.ts - Enhanced vibe matching
const scored = analyzedTracks.map(t => {
    let score = 0;
    let factors = 0;
    
    // === MOOD MATCHING (50% total - the heart of vibe) ===
    
    // Happy mood (15%)
    if (sourceTrack.moodHappy !== null && t.moodHappy !== null) {
        score += (1 - Math.abs(sourceTrack.moodHappy - t.moodHappy)) * 0.15;
        factors += 0.15;
    }
    
    // Sad mood (10%)
    if (sourceTrack.moodSad !== null && t.moodSad !== null) {
        score += (1 - Math.abs(sourceTrack.moodSad - t.moodSad)) * 0.10;
        factors += 0.10;
    }
    
    // Relaxed mood (10%)
    if (sourceTrack.moodRelaxed !== null && t.moodRelaxed !== null) {
        score += (1 - Math.abs(sourceTrack.moodRelaxed - t.moodRelaxed)) * 0.10;
        factors += 0.10;
    }
    
    // Aggressive mood (10%)
    if (sourceTrack.moodAggressive !== null && t.moodAggressive !== null) {
        score += (1 - Math.abs(sourceTrack.moodAggressive - t.moodAggressive)) * 0.10;
        factors += 0.10;
    }
    
    // Valence - overall positivity (5%)
    if (sourceTrack.valence !== null && t.valence !== null) {
        score += (1 - Math.abs(sourceTrack.valence - t.valence)) * 0.05;
        factors += 0.05;
    }
    
    // === AUDIO CHARACTERISTICS (35% total) ===
    
    // BPM (15%) - within ±15 BPM is good
    if (sourceTrack.bpm && t.bpm) {
        const bpmDiff = Math.abs(sourceTrack.bpm - t.bpm);
        score += Math.max(0, 1 - bpmDiff / 30) * 0.15;
        factors += 0.15;
    }
    
    // Energy (10%)
    if (sourceTrack.energy !== null && t.energy !== null) {
        score += (1 - Math.abs(sourceTrack.energy - t.energy)) * 0.10;
        factors += 0.10;
    }
    
    // Danceability - prefer ML version (10%)
    const srcDance = sourceTrack.danceabilityMl ?? sourceTrack.danceability;
    const tDance = t.danceabilityMl ?? t.danceability;
    if (srcDance !== null && tDance !== null) {
        score += (1 - Math.abs(srcDance - tDance)) * 0.10;
        factors += 0.10;
    }
    
    // === GENRE/TAGS (15% total) ===
    
    // Genre/tag overlap (10%)
    const sourceGenres = [...(sourceTrack.lastfmTags || []), ...(sourceTrack.essentiaGenres || [])];
    const trackGenres = [...(t.lastfmTags || []), ...(t.essentiaGenres || [])];
    if (sourceGenres.length > 0 && trackGenres.length > 0) {
        const overlap = sourceGenres.filter(g => trackGenres.includes(g)).length;
        const maxOverlap = Math.max(sourceGenres.length, trackGenres.length);
        score += (overlap / maxOverlap) * 0.10;
        factors += 0.10;
    }
    
    // Key compatibility (5%)
    if (sourceTrack.keyScale && t.keyScale) {
        score += (sourceTrack.keyScale === t.keyScale ? 1 : 0.5) * 0.05;
        factors += 0.05;
    }
    
    const finalScore = factors > 0 ? score / factors : 0;
    return { id: t.id, score: finalScore };
});

Phase 5: Create Standard Mode Fallback (Day 5)

After Enhanced mode is working, implement Standard mode:

Same algorithm structure but skip ML features
Use estimated valence (improved heuristics)
Lower weights on mood matching since it's estimated
Higher weights on BPM, energy, genre tags

Phase 6: Settings & UI (Day 6)

6.1 Add Settings Toggle

// System settings - Enhanced is DEFAULT
{
  audioAnalysis: {
    vibeMatchingMode: 'enhanced' | 'standard',  // Default: 'enhanced'
    reanalyzeOnModeChange: boolean,  // Default: false
  }
}

6.2 Settings UI

Audio Analysis
├── Vibe Matching Mode
│   ├── ● Enhanced (Recommended - Default)
│   │   └── Uses ML models for accurate mood detection
│   └── ○ Standard (Power Saver)
│       └── Faster, uses basic audio features only
│
├── Analysis Status
│   └── "1,234 / 1,500 tracks analyzed (Enhanced mode)"
│
└── [Re-analyze Library] button
    └── "Re-analyze all tracks with current settings"

Phase 7: Testing & Validation (Day 7)

7.1 Test Cases

Source Track	Bad Match (Current)	Expected Good Match
"Fake Happy" (Paramore)	"Summer Girl" (Jamiroquai) 97%	Other emo/pop-punk <60%
"Creep" (Radiohead)	Fast dance track	Other melancholic rock
"Uptown Funk"	Slow ballad	Other high-energy funk/pop

7.2 Performance Testing

Analyze 100 tracks, measure time
Memory usage during analysis
Queue handling under load

Database Schema Updates

model Track {
  // ... existing fields ...
  
  // ML-based mood predictions (Enhanced mode)
  moodHappy         Float?  // ML prediction 0-1
  moodSad           Float?  // ML prediction 0-1
  moodRelaxed       Float?  // ML prediction 0-1
  moodAggressive    Float?  // ML prediction 0-1
  danceabilityMl    Float?  // ML-based danceability
  
  // Analysis metadata
  analysisMode      String? // 'standard' or 'enhanced'
}

Performance Benchmarks (Estimated)

Operation	Standard Mode	Enhanced Mode
Analysis per track	1-2 sec	5-10 sec
RAM usage	~100MB	~500MB
Models in Docker	N/A	~200MB (pre-packaged)
Vibe match query	<100ms	<100ms
Full library (1000 tracks)	~30 min	~2-3 hours

Files to Modify

File	Changes
`services/audio-analyzer/Dockerfile`	Add model downloads during build
`services/audio-analyzer/analyzer.py`	Implement ML model loading and prediction
`backend/prisma/schema.prisma`	Add mood prediction columns
`backend/src/routes/library.ts`	Update vibe matching algorithm weights
`frontend/features/settings/`	Add analysis mode toggle (default: enhanced)
`frontend/components/player/VibeGraph.tsx`	Display mood predictions

Success Metrics

After implementation, "Fake Happy" and "Summer Girl" should:

Match at <50% (different emotional content, different genre)

Better matches for "Fake Happy" would be:

Other Paramore songs (same artist = genre/production match)
Emo/pop-punk with similar emotional complexity
Songs with high energy but mixed emotional signals

Implementation Order (Enhanced First)

Week 1: Get Enhanced Mode Working

Create implementation plan (this document)
Update Dockerfile to pre-package ML models (~200MB)
Rewrite analyzer.py with TensorFlow model loading
Add new database columns for mood predictions (moodHappy, moodSad, etc.)
Update vibe matching algorithm with ML mood weights
Update programmatic playlists to use ML mood predictions
Run Prisma migration to apply schema changes
Rebuild audio-analyzer Docker container
Test ML analysis on sample tracks

Week 2: Polish & Fallback

Test accuracy with diverse track pairs
Add settings UI (Enhanced = default)
Implement Standard mode as explicit fallback option
Update VibeGraph to show mood predictions
Documentation and testing

Quick Reference: Models to Include

Model	File	Purpose	Size
Embeddings	`discogs-effnet-bs64-1.pb`	Base model for all predictions	~85MB
Happy	`mood_happy-discogs-effnet-1.pb`	Happiness detection	~15MB
Sad	`mood_sad-discogs-effnet-1.pb`	Sadness detection	~15MB
Relaxed	`mood_relaxed-discogs-effnet-1.pb`	Relaxation detection	~15MB
Aggressive	`mood_aggressive-discogs-effnet-1.pb`	Aggression detection	~15MB
Arousal	`mood_arousal-discogs-effnet-1.pb`	Energy/calm scale	~15MB
Valence	`mood_valence-discogs-effnet-1.pb`	Positive/negative	~15MB
Danceability	`danceability-discogs-effnet-1.pb`	ML danceability	~15MB
Voice/Instrumental	`voice_instrumental-discogs-effnet-1.pb`	Vocal detection	~15MB

Total: ~200MB (one-time addition to Docker image)

21 KiB Raw Blame History