Files
lidify/docs/implementation-summaries/vibe-matching/IMPLEMENTATION_PLAN.md
2025-12-25 18:58:06 -06:00

572 lines
21 KiB
Markdown

# Vibe Matching Implementation Plan
## Executive Summary
The current vibe matching system uses Essentia for audio analysis but only extracts **basic features**. Critical mood/emotion features are either placeholder values or poorly estimated. This document outlines a comprehensive plan to achieve Spotify-quality vibe matching while being conscious of performance on user hardware.
## Strategy Update (Latest)
**Default:** Enhanced mode (ML-powered, accurate)
**Fallback:** Standard mode (lightweight, for troubleshooting or power saving)
**Approach:**
1. ✅ Pre-package all Essentia TensorFlow models in Docker image (~200MB)
2. 🔄 Fix Enhanced mode FIRST - make it actually use the ML models
3. ⏳ THEN create Standard mode as a lightweight fallback
4. Users can toggle to Standard mode to save CPU if needed
---
## Current State Analysis
### What Essentia IS Currently Extracting (Working)
| Feature | Status | Quality |
|---------|--------|---------|
| **BPM** | ✅ Working | Good - Uses `RhythmExtractor2013` |
| **Key** | ✅ Working | Good - Uses `KeyExtractor` |
| **KeyScale** | ✅ Working | Good - major/minor detection |
| **Energy** | ✅ Working | Moderate - Raw energy normalized |
| **Loudness** | ✅ Working | Good - dB measurement |
| **Dynamic Range** | ✅ Working | Good |
| **Danceability** | ✅ Working | Good - Uses `Danceability` algorithm |
| **Beats Count** | ✅ Working | Good |
### What's Broken or Placeholder
| Feature | Status | Problem |
|---------|--------|---------|
| **Valence** | ⚠️ Fake | Calculated as `(major/minor * 0.4) + (energy * 0.6)` - NOT actual emotional valence |
| **Arousal** | ⚠️ Fake | Calculated as `(BPM * 0.5) + (energy * 0.5)` - NOT actual arousal |
| **Instrumentalness** | ❌ Placeholder | Hardcoded to `0.5` |
| **Acousticness** | ⚠️ Estimate | Rough estimate from dynamic range |
| **Speechiness** | ❌ Placeholder | Hardcoded to `0.1` |
| **Mood Tags** | ⚠️ Derived | Generated from fake valence/arousal, not ML |
| **Genre Tags** | ❌ Empty | TensorFlow models not loaded |
### The Core Issue
```python
# Current valence calculation (analyzer.py lines 226-231)
key_valence = 0.6 if scale == 'major' else 0.4
energy_valence = result['energy']
result['valence'] = round((key_valence * 0.4 + energy_valence * 0.6), 3)
```
**"Fake Happy" by Paramore** (emotionally complex, about masking sadness):
- Major key → 0.6
- High energy → ~0.7
- Calculated valence: `(0.6 * 0.4) + (0.7 * 0.6) = 0.66` (appears "happy")
**"Summer Girl" by Jamiroquai** (genuinely upbeat funk):
- Major key → 0.6
- High energy → ~0.7
- Calculated valence: `(0.6 * 0.4) + (0.7 * 0.6) = 0.66` (appears "happy")
**Result: 97% match despite being completely different vibes!**
---
## How Spotify Does It
Spotify's audio analysis uses a combination of:
### 1. Low-Level Audio Features (Similar to what we have)
- Tempo/BPM
- Key/Mode
- Loudness
- Time signature
### 2. Mid-Level Features (We're missing these)
- **Spectral Centroid** - "brightness" of the sound
- **Spectral Rolloff** - frequency distribution
- **Zero Crossing Rate** - percussiveness
- **MFCCs** - Mel-frequency cepstral coefficients (timbral texture)
- **Chroma Features** - harmonic content
### 3. High-Level Features (We're faking these)
- **Valence** - Musical positiveness (0-1)
- **Arousal/Energy** - Intensity and activity
- **Instrumentalness** - Vocal presence prediction
- **Acousticness** - Acoustic vs electronic
- **Speechiness** - Presence of spoken words
- **Liveness** - Audience presence detection
### 4. Deep Learning Models
Spotify trains neural networks on millions of labeled tracks to predict:
- Mood categories
- Genre classification
- User preference patterns
---
## Two-Tier System
### Default: Enhanced Vibe Matching (ML-Powered)
**Status:** DEFAULT - Pre-packaged in Docker, just works
**Target:** High accuracy, ~5-10 seconds per track
**Features (from Essentia TensorFlow Models):**
1. **Mood Predictions (real ML, not estimated):**
- `mood_happy-discogs-effnet-1.pb` - Happiness/positivity 0-1
- `mood_sad-discogs-effnet-1.pb` - Sadness 0-1
- `mood_relaxed-discogs-effnet-1.pb` - Relaxation/calmness 0-1
- `mood_aggressive-discogs-effnet-1.pb` - Aggression/intensity 0-1
2. **Audio Characteristics:**
- `danceability-discogs-effnet-1.pb` - ML-based danceability
- `voice_instrumental-discogs-effnet-1.pb` - Vocal detection (instrumentalness)
3. **Embeddings for Similarity:**
- `discogs-effnet-bs64-1.pb` - Audio embeddings (neural "fingerprint")
- Can be used for direct similarity comparison
4. **Spectral Features:**
- Spectral Centroid (brightness)
- MFCCs (timbral texture - 13 coefficients)
**Models Pre-packaged:** ~200MB in Docker image (no user download)
**RAM Requirement:** ~500MB during analysis
**CPU Requirement:** Any modern CPU (2015+)
### Fallback: Standard Vibe Matching (Lightweight)
**Status:** FALLBACK - For troubleshooting or power saving
**Target:** Fast, <2 seconds per track, low CPU
**Features Used:**
- BPM (Essentia RhythmExtractor)
- Energy (Essentia Energy)
- Danceability (Essentia Danceability - non-ML version)
- Key/Scale (Essentia KeyExtractor)
- Spectral Centroid (cheap to compute)
- Last.fm mood tags
- Genre matching from tags
**When to use Standard mode:**
- Low-power devices (Raspberry Pi, older NAS)
- Troubleshooting if Enhanced mode has issues
- User preference to save CPU cycles
---
## Implementation Plan
### Phase 1: Pre-Package Models in Docker (Day 1)
#### 1.1 Update Dockerfile to Include Models
```dockerfile
# Download Essentia ML models during build (~200MB)
RUN apt-get update && apt-get install -y --no-install-recommends curl && \
# Base embedding model (required for all predictions)
curl -L -o /app/models/discogs-effnet-bs64-1.pb \
"https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb" && \
# Mood models
curl -L -o /app/models/mood_happy-discogs-effnet-1.pb \
"https://essentia.upf.edu/models/classification-heads/mood_happy/mood_happy-discogs-effnet-1.pb" && \
curl -L -o /app/models/mood_sad-discogs-effnet-1.pb \
"https://essentia.upf.edu/models/classification-heads/mood_sad/mood_sad-discogs-effnet-1.pb" && \
curl -L -o /app/models/mood_relaxed-discogs-effnet-1.pb \
"https://essentia.upf.edu/models/classification-heads/mood_relaxed/mood_relaxed-discogs-effnet-1.pb" && \
curl -L -o /app/models/mood_aggressive-discogs-effnet-1.pb \
"https://essentia.upf.edu/models/classification-heads/mood_aggressive/mood_aggressive-discogs-effnet-1.pb" && \
# Danceability and voice/instrumental
curl -L -o /app/models/danceability-discogs-effnet-1.pb \
"https://essentia.upf.edu/models/classification-heads/danceability/danceability-discogs-effnet-1.pb" && \
curl -L -o /app/models/voice_instrumental-discogs-effnet-1.pb \
"https://essentia.upf.edu/models/classification-heads/voice_instrumental/voice_instrumental-discogs-effnet-1.pb" && \
# Arousal/Valence models
curl -L -o /app/models/arousal-discogs-effnet-1.pb \
"https://essentia.upf.edu/models/classification-heads/mood_arousal/mood_arousal-discogs-effnet-1.pb" && \
curl -L -o /app/models/valence-discogs-effnet-1.pb \
"https://essentia.upf.edu/models/classification-heads/mood_valence/mood_valence-discogs-effnet-1.pb" && \
apt-get purge -y curl && rm -rf /var/lib/apt/lists/*
```
### Phase 2: Implement Enhanced Analysis (Days 2-4)
#### 2.1 Rewrite analyzer.py with ML Models
```python
class AudioAnalyzer:
"""Enhanced audio analysis using Essentia TensorFlow models"""
def __init__(self):
self.models_loaded = False
self.embedding_model = None
self.mood_models = {}
if ESSENTIA_AVAILABLE:
self._init_essentia()
self._load_ml_models()
def _load_ml_models(self):
"""Load TensorFlow models for enhanced analysis"""
try:
from essentia.standard import (
TensorflowPredictEffnetDiscogs,
TensorflowPredict2D
)
# Load embedding extractor (base for all predictions)
embedding_path = '/app/models/discogs-effnet-bs64-1.pb'
if os.path.exists(embedding_path):
self.embedding_model = TensorflowPredictEffnetDiscogs(
graphFilename=embedding_path,
output="PartitionedCall:1"
)
logger.info("Loaded embedding model")
# Load mood prediction models
mood_models = {
'happy': '/app/models/mood_happy-discogs-effnet-1.pb',
'sad': '/app/models/mood_sad-discogs-effnet-1.pb',
'relaxed': '/app/models/mood_relaxed-discogs-effnet-1.pb',
'aggressive': '/app/models/mood_aggressive-discogs-effnet-1.pb',
'danceability': '/app/models/danceability-discogs-effnet-1.pb',
'voice_instrumental': '/app/models/voice_instrumental-discogs-effnet-1.pb',
'arousal': '/app/models/arousal-discogs-effnet-1.pb',
'valence': '/app/models/valence-discogs-effnet-1.pb',
}
for name, path in mood_models.items():
if os.path.exists(path):
self.mood_models[name] = TensorflowPredict2D(
graphFilename=path,
output="model/Softmax"
)
logger.info(f"Loaded {name} model")
self.models_loaded = len(self.mood_models) > 0
logger.info(f"ML models loaded: {self.models_loaded} ({len(self.mood_models)} models)")
except Exception as e:
logger.warning(f"Could not load ML models: {e}")
self.models_loaded = False
def analyze(self, file_path: str) -> Dict[str, Any]:
"""Full analysis with ML models if available"""
result = self._extract_basic_features(file_path)
if self.models_loaded:
ml_features = self._extract_ml_features(file_path)
result.update(ml_features)
result['analysisMode'] = 'enhanced'
else:
# Fallback to estimated values
result.update(self._estimate_mood_features(result))
result['analysisMode'] = 'standard'
return result
def _extract_ml_features(self, file_path: str) -> Dict[str, Any]:
"""Extract features using TensorFlow models"""
result = {}
# Load audio at 16kHz for ML models
audio = self.load_audio(file_path, sample_rate=16000)
if audio is None:
return result
# Get embeddings
embeddings = self.embedding_model(audio)
# Mood predictions
if 'happy' in self.mood_models:
preds = self.mood_models['happy'](embeddings)
result['moodHappy'] = float(np.mean(preds[:, 1])) # Probability of "happy"
if 'sad' in self.mood_models:
preds = self.mood_models['sad'](embeddings)
result['moodSad'] = float(np.mean(preds[:, 1]))
if 'relaxed' in self.mood_models:
preds = self.mood_models['relaxed'](embeddings)
result['moodRelaxed'] = float(np.mean(preds[:, 1]))
if 'aggressive' in self.mood_models:
preds = self.mood_models['aggressive'](embeddings)
result['moodAggressive'] = float(np.mean(preds[:, 1]))
# Real valence and arousal from dedicated models
if 'valence' in self.mood_models:
preds = self.mood_models['valence'](embeddings)
result['valence'] = float(np.mean(preds[:, 1]))
if 'arousal' in self.mood_models:
preds = self.mood_models['arousal'](embeddings)
result['arousal'] = float(np.mean(preds[:, 1]))
# Instrumentalness from voice/instrumental model
if 'voice_instrumental' in self.mood_models:
preds = self.mood_models['voice_instrumental'](embeddings)
result['instrumentalness'] = float(np.mean(preds[:, 1])) # 1 = instrumental
# ML-based danceability
if 'danceability' in self.mood_models:
preds = self.mood_models['danceability'](embeddings)
result['danceabilityMl'] = float(np.mean(preds[:, 1]))
return result
```
### Phase 3: Update Database Schema (Day 3)
#### 3.1 Add New Feature Columns
```prisma
model Track {
// ... existing fields ...
// ML-based mood predictions (Enhanced mode)
moodHappy Float? // ML prediction 0-1
moodSad Float? // ML prediction 0-1
moodRelaxed Float? // ML prediction 0-1
moodAggressive Float? // ML prediction 0-1
danceabilityMl Float? // ML-based danceability
// Analysis metadata
analysisMode String? // 'standard' or 'enhanced'
}
```
### Phase 4: Update Vibe Matching Algorithm (Day 4)
#### 4.1 Use Real Mood Predictions in Matching
```typescript
// In library.ts - Enhanced vibe matching
const scored = analyzedTracks.map(t => {
let score = 0;
let factors = 0;
// === MOOD MATCHING (50% total - the heart of vibe) ===
// Happy mood (15%)
if (sourceTrack.moodHappy !== null && t.moodHappy !== null) {
score += (1 - Math.abs(sourceTrack.moodHappy - t.moodHappy)) * 0.15;
factors += 0.15;
}
// Sad mood (10%)
if (sourceTrack.moodSad !== null && t.moodSad !== null) {
score += (1 - Math.abs(sourceTrack.moodSad - t.moodSad)) * 0.10;
factors += 0.10;
}
// Relaxed mood (10%)
if (sourceTrack.moodRelaxed !== null && t.moodRelaxed !== null) {
score += (1 - Math.abs(sourceTrack.moodRelaxed - t.moodRelaxed)) * 0.10;
factors += 0.10;
}
// Aggressive mood (10%)
if (sourceTrack.moodAggressive !== null && t.moodAggressive !== null) {
score += (1 - Math.abs(sourceTrack.moodAggressive - t.moodAggressive)) * 0.10;
factors += 0.10;
}
// Valence - overall positivity (5%)
if (sourceTrack.valence !== null && t.valence !== null) {
score += (1 - Math.abs(sourceTrack.valence - t.valence)) * 0.05;
factors += 0.05;
}
// === AUDIO CHARACTERISTICS (35% total) ===
// BPM (15%) - within ±15 BPM is good
if (sourceTrack.bpm && t.bpm) {
const bpmDiff = Math.abs(sourceTrack.bpm - t.bpm);
score += Math.max(0, 1 - bpmDiff / 30) * 0.15;
factors += 0.15;
}
// Energy (10%)
if (sourceTrack.energy !== null && t.energy !== null) {
score += (1 - Math.abs(sourceTrack.energy - t.energy)) * 0.10;
factors += 0.10;
}
// Danceability - prefer ML version (10%)
const srcDance = sourceTrack.danceabilityMl ?? sourceTrack.danceability;
const tDance = t.danceabilityMl ?? t.danceability;
if (srcDance !== null && tDance !== null) {
score += (1 - Math.abs(srcDance - tDance)) * 0.10;
factors += 0.10;
}
// === GENRE/TAGS (15% total) ===
// Genre/tag overlap (10%)
const sourceGenres = [...(sourceTrack.lastfmTags || []), ...(sourceTrack.essentiaGenres || [])];
const trackGenres = [...(t.lastfmTags || []), ...(t.essentiaGenres || [])];
if (sourceGenres.length > 0 && trackGenres.length > 0) {
const overlap = sourceGenres.filter(g => trackGenres.includes(g)).length;
const maxOverlap = Math.max(sourceGenres.length, trackGenres.length);
score += (overlap / maxOverlap) * 0.10;
factors += 0.10;
}
// Key compatibility (5%)
if (sourceTrack.keyScale && t.keyScale) {
score += (sourceTrack.keyScale === t.keyScale ? 1 : 0.5) * 0.05;
factors += 0.05;
}
const finalScore = factors > 0 ? score / factors : 0;
return { id: t.id, score: finalScore };
});
```
### Phase 5: Create Standard Mode Fallback (Day 5)
After Enhanced mode is working, implement Standard mode:
- Same algorithm structure but skip ML features
- Use estimated valence (improved heuristics)
- Lower weights on mood matching since it's estimated
- Higher weights on BPM, energy, genre tags
### Phase 6: Settings & UI (Day 6)
#### 6.1 Add Settings Toggle
```typescript
// System settings - Enhanced is DEFAULT
{
audioAnalysis: {
vibeMatchingMode: 'enhanced' | 'standard', // Default: 'enhanced'
reanalyzeOnModeChange: boolean, // Default: false
}
}
```
#### 6.2 Settings UI
```
Audio Analysis
├── Vibe Matching Mode
│ ├── ● Enhanced (Recommended - Default)
│ │ └── Uses ML models for accurate mood detection
│ └── ○ Standard (Power Saver)
│ └── Faster, uses basic audio features only
├── Analysis Status
│ └── "1,234 / 1,500 tracks analyzed (Enhanced mode)"
└── [Re-analyze Library] button
└── "Re-analyze all tracks with current settings"
```
### Phase 7: Testing & Validation (Day 7)
#### 7.1 Test Cases
| Source Track | Bad Match (Current) | Expected Good Match |
|--------------|---------------------|---------------------|
| "Fake Happy" (Paramore) | "Summer Girl" (Jamiroquai) 97% | Other emo/pop-punk <60% |
| "Creep" (Radiohead) | Fast dance track | Other melancholic rock |
| "Uptown Funk" | Slow ballad | Other high-energy funk/pop |
#### 7.2 Performance Testing
- Analyze 100 tracks, measure time
- Memory usage during analysis
- Queue handling under load
---
## Database Schema Updates
```prisma
model Track {
// ... existing fields ...
// ML-based mood predictions (Enhanced mode)
moodHappy Float? // ML prediction 0-1
moodSad Float? // ML prediction 0-1
moodRelaxed Float? // ML prediction 0-1
moodAggressive Float? // ML prediction 0-1
danceabilityMl Float? // ML-based danceability
// Analysis metadata
analysisMode String? // 'standard' or 'enhanced'
}
```
---
## Performance Benchmarks (Estimated)
| Operation | Standard Mode | Enhanced Mode |
|-----------|---------------|---------------|
| Analysis per track | 1-2 sec | 5-10 sec |
| RAM usage | ~100MB | ~500MB |
| Models in Docker | N/A | ~200MB (pre-packaged) |
| Vibe match query | <100ms | <100ms |
| Full library (1000 tracks) | ~30 min | ~2-3 hours |
---
## Files to Modify
| File | Changes |
|------|---------|
| `services/audio-analyzer/Dockerfile` | Add model downloads during build |
| `services/audio-analyzer/analyzer.py` | Implement ML model loading and prediction |
| `backend/prisma/schema.prisma` | Add mood prediction columns |
| `backend/src/routes/library.ts` | Update vibe matching algorithm weights |
| `frontend/features/settings/` | Add analysis mode toggle (default: enhanced) |
| `frontend/components/player/VibeGraph.tsx` | Display mood predictions |
---
## Success Metrics
After implementation, "Fake Happy" and "Summer Girl" should:
- Match at **<50%** (different emotional content, different genre)
Better matches for "Fake Happy" would be:
- Other Paramore songs (same artist = genre/production match)
- Emo/pop-punk with similar emotional complexity
- Songs with high energy but mixed emotional signals
---
## Implementation Order (Enhanced First)
### Week 1: Get Enhanced Mode Working
1. [x] Create implementation plan (this document)
2. [x] Update Dockerfile to pre-package ML models (~200MB)
3. [x] Rewrite analyzer.py with TensorFlow model loading
4. [x] Add new database columns for mood predictions (moodHappy, moodSad, etc.)
5. [x] Update vibe matching algorithm with ML mood weights
6. [x] Update programmatic playlists to use ML mood predictions
7. [ ] Run Prisma migration to apply schema changes
8. [ ] Rebuild audio-analyzer Docker container
9. [ ] Test ML analysis on sample tracks
### Week 2: Polish & Fallback
10. [ ] Test accuracy with diverse track pairs
11. [ ] Add settings UI (Enhanced = default)
12. [ ] Implement Standard mode as explicit fallback option
13. [ ] Update VibeGraph to show mood predictions
14. [ ] Documentation and testing
---
## Quick Reference: Models to Include
| Model | File | Purpose | Size |
|-------|------|---------|------|
| Embeddings | `discogs-effnet-bs64-1.pb` | Base model for all predictions | ~85MB |
| Happy | `mood_happy-discogs-effnet-1.pb` | Happiness detection | ~15MB |
| Sad | `mood_sad-discogs-effnet-1.pb` | Sadness detection | ~15MB |
| Relaxed | `mood_relaxed-discogs-effnet-1.pb` | Relaxation detection | ~15MB |
| Aggressive | `mood_aggressive-discogs-effnet-1.pb` | Aggression detection | ~15MB |
| Arousal | `mood_arousal-discogs-effnet-1.pb` | Energy/calm scale | ~15MB |
| Valence | `mood_valence-discogs-effnet-1.pb` | Positive/negative | ~15MB |
| Danceability | `danceability-discogs-effnet-1.pb` | ML danceability | ~15MB |
| Voice/Instrumental | `voice_instrumental-discogs-effnet-1.pb` | Vocal detection | ~15MB |
**Total:** ~200MB (one-time addition to Docker image)