572 lines
21 KiB
Markdown
572 lines
21 KiB
Markdown
# Vibe Matching Implementation Plan
|
|
|
|
## Executive Summary
|
|
|
|
The current vibe matching system uses Essentia for audio analysis but only extracts **basic features**. Critical mood/emotion features are either placeholder values or poorly estimated. This document outlines a comprehensive plan to achieve Spotify-quality vibe matching while being conscious of performance on user hardware.
|
|
|
|
## Strategy Update (Latest)
|
|
|
|
**Default:** Enhanced mode (ML-powered, accurate)
|
|
**Fallback:** Standard mode (lightweight, for troubleshooting or power saving)
|
|
|
|
**Approach:**
|
|
1. ✅ Pre-package all Essentia TensorFlow models in Docker image (~200MB)
|
|
2. 🔄 Fix Enhanced mode FIRST - make it actually use the ML models
|
|
3. ⏳ THEN create Standard mode as a lightweight fallback
|
|
4. Users can toggle to Standard mode to save CPU if needed
|
|
|
|
---
|
|
|
|
## Current State Analysis
|
|
|
|
### What Essentia IS Currently Extracting (Working)
|
|
|
|
| Feature | Status | Quality |
|
|
|---------|--------|---------|
|
|
| **BPM** | ✅ Working | Good - Uses `RhythmExtractor2013` |
|
|
| **Key** | ✅ Working | Good - Uses `KeyExtractor` |
|
|
| **KeyScale** | ✅ Working | Good - major/minor detection |
|
|
| **Energy** | ✅ Working | Moderate - Raw energy normalized |
|
|
| **Loudness** | ✅ Working | Good - dB measurement |
|
|
| **Dynamic Range** | ✅ Working | Good |
|
|
| **Danceability** | ✅ Working | Good - Uses `Danceability` algorithm |
|
|
| **Beats Count** | ✅ Working | Good |
|
|
|
|
### What's Broken or Placeholder
|
|
|
|
| Feature | Status | Problem |
|
|
|---------|--------|---------|
|
|
| **Valence** | ⚠️ Fake | Calculated as `(major/minor * 0.4) + (energy * 0.6)` - NOT actual emotional valence |
|
|
| **Arousal** | ⚠️ Fake | Calculated as `(BPM * 0.5) + (energy * 0.5)` - NOT actual arousal |
|
|
| **Instrumentalness** | ❌ Placeholder | Hardcoded to `0.5` |
|
|
| **Acousticness** | ⚠️ Estimate | Rough estimate from dynamic range |
|
|
| **Speechiness** | ❌ Placeholder | Hardcoded to `0.1` |
|
|
| **Mood Tags** | ⚠️ Derived | Generated from fake valence/arousal, not ML |
|
|
| **Genre Tags** | ❌ Empty | TensorFlow models not loaded |
|
|
|
|
### The Core Issue
|
|
|
|
```python
|
|
# Current valence calculation (analyzer.py lines 226-231)
|
|
key_valence = 0.6 if scale == 'major' else 0.4
|
|
energy_valence = result['energy']
|
|
result['valence'] = round((key_valence * 0.4 + energy_valence * 0.6), 3)
|
|
```
|
|
|
|
**"Fake Happy" by Paramore** (emotionally complex, about masking sadness):
|
|
- Major key → 0.6
|
|
- High energy → ~0.7
|
|
- Calculated valence: `(0.6 * 0.4) + (0.7 * 0.6) = 0.66` (appears "happy")
|
|
|
|
**"Summer Girl" by Jamiroquai** (genuinely upbeat funk):
|
|
- Major key → 0.6
|
|
- High energy → ~0.7
|
|
- Calculated valence: `(0.6 * 0.4) + (0.7 * 0.6) = 0.66` (appears "happy")
|
|
|
|
**Result: 97% match despite being completely different vibes!**
|
|
|
|
---
|
|
|
|
## How Spotify Does It
|
|
|
|
Spotify's audio analysis uses a combination of:
|
|
|
|
### 1. Low-Level Audio Features (Similar to what we have)
|
|
- Tempo/BPM
|
|
- Key/Mode
|
|
- Loudness
|
|
- Time signature
|
|
|
|
### 2. Mid-Level Features (We're missing these)
|
|
- **Spectral Centroid** - "brightness" of the sound
|
|
- **Spectral Rolloff** - frequency distribution
|
|
- **Zero Crossing Rate** - percussiveness
|
|
- **MFCCs** - Mel-frequency cepstral coefficients (timbral texture)
|
|
- **Chroma Features** - harmonic content
|
|
|
|
### 3. High-Level Features (We're faking these)
|
|
- **Valence** - Musical positiveness (0-1)
|
|
- **Arousal/Energy** - Intensity and activity
|
|
- **Instrumentalness** - Vocal presence prediction
|
|
- **Acousticness** - Acoustic vs electronic
|
|
- **Speechiness** - Presence of spoken words
|
|
- **Liveness** - Audience presence detection
|
|
|
|
### 4. Deep Learning Models
|
|
Spotify trains neural networks on millions of labeled tracks to predict:
|
|
- Mood categories
|
|
- Genre classification
|
|
- User preference patterns
|
|
|
|
---
|
|
|
|
## Two-Tier System
|
|
|
|
### Default: Enhanced Vibe Matching (ML-Powered)
|
|
**Status:** DEFAULT - Pre-packaged in Docker, just works
|
|
**Target:** High accuracy, ~5-10 seconds per track
|
|
|
|
**Features (from Essentia TensorFlow Models):**
|
|
1. **Mood Predictions (real ML, not estimated):**
|
|
- `mood_happy-discogs-effnet-1.pb` - Happiness/positivity 0-1
|
|
- `mood_sad-discogs-effnet-1.pb` - Sadness 0-1
|
|
- `mood_relaxed-discogs-effnet-1.pb` - Relaxation/calmness 0-1
|
|
- `mood_aggressive-discogs-effnet-1.pb` - Aggression/intensity 0-1
|
|
|
|
2. **Audio Characteristics:**
|
|
- `danceability-discogs-effnet-1.pb` - ML-based danceability
|
|
- `voice_instrumental-discogs-effnet-1.pb` - Vocal detection (instrumentalness)
|
|
|
|
3. **Embeddings for Similarity:**
|
|
- `discogs-effnet-bs64-1.pb` - Audio embeddings (neural "fingerprint")
|
|
- Can be used for direct similarity comparison
|
|
|
|
4. **Spectral Features:**
|
|
- Spectral Centroid (brightness)
|
|
- MFCCs (timbral texture - 13 coefficients)
|
|
|
|
**Models Pre-packaged:** ~200MB in Docker image (no user download)
|
|
**RAM Requirement:** ~500MB during analysis
|
|
**CPU Requirement:** Any modern CPU (2015+)
|
|
|
|
### Fallback: Standard Vibe Matching (Lightweight)
|
|
**Status:** FALLBACK - For troubleshooting or power saving
|
|
**Target:** Fast, <2 seconds per track, low CPU
|
|
|
|
**Features Used:**
|
|
- BPM (Essentia RhythmExtractor)
|
|
- Energy (Essentia Energy)
|
|
- Danceability (Essentia Danceability - non-ML version)
|
|
- Key/Scale (Essentia KeyExtractor)
|
|
- Spectral Centroid (cheap to compute)
|
|
- Last.fm mood tags
|
|
- Genre matching from tags
|
|
|
|
**When to use Standard mode:**
|
|
- Low-power devices (Raspberry Pi, older NAS)
|
|
- Troubleshooting if Enhanced mode has issues
|
|
- User preference to save CPU cycles
|
|
|
|
---
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Pre-Package Models in Docker (Day 1)
|
|
|
|
#### 1.1 Update Dockerfile to Include Models
|
|
|
|
```dockerfile
|
|
# Download Essentia ML models during build (~200MB)
|
|
RUN apt-get update && apt-get install -y --no-install-recommends curl && \
|
|
# Base embedding model (required for all predictions)
|
|
curl -L -o /app/models/discogs-effnet-bs64-1.pb \
|
|
"https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb" && \
|
|
# Mood models
|
|
curl -L -o /app/models/mood_happy-discogs-effnet-1.pb \
|
|
"https://essentia.upf.edu/models/classification-heads/mood_happy/mood_happy-discogs-effnet-1.pb" && \
|
|
curl -L -o /app/models/mood_sad-discogs-effnet-1.pb \
|
|
"https://essentia.upf.edu/models/classification-heads/mood_sad/mood_sad-discogs-effnet-1.pb" && \
|
|
curl -L -o /app/models/mood_relaxed-discogs-effnet-1.pb \
|
|
"https://essentia.upf.edu/models/classification-heads/mood_relaxed/mood_relaxed-discogs-effnet-1.pb" && \
|
|
curl -L -o /app/models/mood_aggressive-discogs-effnet-1.pb \
|
|
"https://essentia.upf.edu/models/classification-heads/mood_aggressive/mood_aggressive-discogs-effnet-1.pb" && \
|
|
# Danceability and voice/instrumental
|
|
curl -L -o /app/models/danceability-discogs-effnet-1.pb \
|
|
"https://essentia.upf.edu/models/classification-heads/danceability/danceability-discogs-effnet-1.pb" && \
|
|
curl -L -o /app/models/voice_instrumental-discogs-effnet-1.pb \
|
|
"https://essentia.upf.edu/models/classification-heads/voice_instrumental/voice_instrumental-discogs-effnet-1.pb" && \
|
|
# Arousal/Valence models
|
|
curl -L -o /app/models/arousal-discogs-effnet-1.pb \
|
|
"https://essentia.upf.edu/models/classification-heads/mood_arousal/mood_arousal-discogs-effnet-1.pb" && \
|
|
curl -L -o /app/models/valence-discogs-effnet-1.pb \
|
|
"https://essentia.upf.edu/models/classification-heads/mood_valence/mood_valence-discogs-effnet-1.pb" && \
|
|
apt-get purge -y curl && rm -rf /var/lib/apt/lists/*
|
|
```
|
|
|
|
### Phase 2: Implement Enhanced Analysis (Days 2-4)
|
|
|
|
#### 2.1 Rewrite analyzer.py with ML Models
|
|
|
|
```python
|
|
class AudioAnalyzer:
|
|
"""Enhanced audio analysis using Essentia TensorFlow models"""
|
|
|
|
def __init__(self):
|
|
self.models_loaded = False
|
|
self.embedding_model = None
|
|
self.mood_models = {}
|
|
|
|
if ESSENTIA_AVAILABLE:
|
|
self._init_essentia()
|
|
self._load_ml_models()
|
|
|
|
def _load_ml_models(self):
|
|
"""Load TensorFlow models for enhanced analysis"""
|
|
try:
|
|
from essentia.standard import (
|
|
TensorflowPredictEffnetDiscogs,
|
|
TensorflowPredict2D
|
|
)
|
|
|
|
# Load embedding extractor (base for all predictions)
|
|
embedding_path = '/app/models/discogs-effnet-bs64-1.pb'
|
|
if os.path.exists(embedding_path):
|
|
self.embedding_model = TensorflowPredictEffnetDiscogs(
|
|
graphFilename=embedding_path,
|
|
output="PartitionedCall:1"
|
|
)
|
|
logger.info("Loaded embedding model")
|
|
|
|
# Load mood prediction models
|
|
mood_models = {
|
|
'happy': '/app/models/mood_happy-discogs-effnet-1.pb',
|
|
'sad': '/app/models/mood_sad-discogs-effnet-1.pb',
|
|
'relaxed': '/app/models/mood_relaxed-discogs-effnet-1.pb',
|
|
'aggressive': '/app/models/mood_aggressive-discogs-effnet-1.pb',
|
|
'danceability': '/app/models/danceability-discogs-effnet-1.pb',
|
|
'voice_instrumental': '/app/models/voice_instrumental-discogs-effnet-1.pb',
|
|
'arousal': '/app/models/arousal-discogs-effnet-1.pb',
|
|
'valence': '/app/models/valence-discogs-effnet-1.pb',
|
|
}
|
|
|
|
for name, path in mood_models.items():
|
|
if os.path.exists(path):
|
|
self.mood_models[name] = TensorflowPredict2D(
|
|
graphFilename=path,
|
|
output="model/Softmax"
|
|
)
|
|
logger.info(f"Loaded {name} model")
|
|
|
|
self.models_loaded = len(self.mood_models) > 0
|
|
logger.info(f"ML models loaded: {self.models_loaded} ({len(self.mood_models)} models)")
|
|
|
|
except Exception as e:
|
|
logger.warning(f"Could not load ML models: {e}")
|
|
self.models_loaded = False
|
|
|
|
def analyze(self, file_path: str) -> Dict[str, Any]:
|
|
"""Full analysis with ML models if available"""
|
|
result = self._extract_basic_features(file_path)
|
|
|
|
if self.models_loaded:
|
|
ml_features = self._extract_ml_features(file_path)
|
|
result.update(ml_features)
|
|
result['analysisMode'] = 'enhanced'
|
|
else:
|
|
# Fallback to estimated values
|
|
result.update(self._estimate_mood_features(result))
|
|
result['analysisMode'] = 'standard'
|
|
|
|
return result
|
|
|
|
def _extract_ml_features(self, file_path: str) -> Dict[str, Any]:
|
|
"""Extract features using TensorFlow models"""
|
|
result = {}
|
|
|
|
# Load audio at 16kHz for ML models
|
|
audio = self.load_audio(file_path, sample_rate=16000)
|
|
if audio is None:
|
|
return result
|
|
|
|
# Get embeddings
|
|
embeddings = self.embedding_model(audio)
|
|
|
|
# Mood predictions
|
|
if 'happy' in self.mood_models:
|
|
preds = self.mood_models['happy'](embeddings)
|
|
result['moodHappy'] = float(np.mean(preds[:, 1])) # Probability of "happy"
|
|
|
|
if 'sad' in self.mood_models:
|
|
preds = self.mood_models['sad'](embeddings)
|
|
result['moodSad'] = float(np.mean(preds[:, 1]))
|
|
|
|
if 'relaxed' in self.mood_models:
|
|
preds = self.mood_models['relaxed'](embeddings)
|
|
result['moodRelaxed'] = float(np.mean(preds[:, 1]))
|
|
|
|
if 'aggressive' in self.mood_models:
|
|
preds = self.mood_models['aggressive'](embeddings)
|
|
result['moodAggressive'] = float(np.mean(preds[:, 1]))
|
|
|
|
# Real valence and arousal from dedicated models
|
|
if 'valence' in self.mood_models:
|
|
preds = self.mood_models['valence'](embeddings)
|
|
result['valence'] = float(np.mean(preds[:, 1]))
|
|
|
|
if 'arousal' in self.mood_models:
|
|
preds = self.mood_models['arousal'](embeddings)
|
|
result['arousal'] = float(np.mean(preds[:, 1]))
|
|
|
|
# Instrumentalness from voice/instrumental model
|
|
if 'voice_instrumental' in self.mood_models:
|
|
preds = self.mood_models['voice_instrumental'](embeddings)
|
|
result['instrumentalness'] = float(np.mean(preds[:, 1])) # 1 = instrumental
|
|
|
|
# ML-based danceability
|
|
if 'danceability' in self.mood_models:
|
|
preds = self.mood_models['danceability'](embeddings)
|
|
result['danceabilityMl'] = float(np.mean(preds[:, 1]))
|
|
|
|
return result
|
|
```
|
|
|
|
### Phase 3: Update Database Schema (Day 3)
|
|
|
|
#### 3.1 Add New Feature Columns
|
|
|
|
```prisma
|
|
model Track {
|
|
// ... existing fields ...
|
|
|
|
// ML-based mood predictions (Enhanced mode)
|
|
moodHappy Float? // ML prediction 0-1
|
|
moodSad Float? // ML prediction 0-1
|
|
moodRelaxed Float? // ML prediction 0-1
|
|
moodAggressive Float? // ML prediction 0-1
|
|
danceabilityMl Float? // ML-based danceability
|
|
|
|
// Analysis metadata
|
|
analysisMode String? // 'standard' or 'enhanced'
|
|
}
|
|
```
|
|
|
|
### Phase 4: Update Vibe Matching Algorithm (Day 4)
|
|
|
|
#### 4.1 Use Real Mood Predictions in Matching
|
|
|
|
```typescript
|
|
// In library.ts - Enhanced vibe matching
|
|
const scored = analyzedTracks.map(t => {
|
|
let score = 0;
|
|
let factors = 0;
|
|
|
|
// === MOOD MATCHING (50% total - the heart of vibe) ===
|
|
|
|
// Happy mood (15%)
|
|
if (sourceTrack.moodHappy !== null && t.moodHappy !== null) {
|
|
score += (1 - Math.abs(sourceTrack.moodHappy - t.moodHappy)) * 0.15;
|
|
factors += 0.15;
|
|
}
|
|
|
|
// Sad mood (10%)
|
|
if (sourceTrack.moodSad !== null && t.moodSad !== null) {
|
|
score += (1 - Math.abs(sourceTrack.moodSad - t.moodSad)) * 0.10;
|
|
factors += 0.10;
|
|
}
|
|
|
|
// Relaxed mood (10%)
|
|
if (sourceTrack.moodRelaxed !== null && t.moodRelaxed !== null) {
|
|
score += (1 - Math.abs(sourceTrack.moodRelaxed - t.moodRelaxed)) * 0.10;
|
|
factors += 0.10;
|
|
}
|
|
|
|
// Aggressive mood (10%)
|
|
if (sourceTrack.moodAggressive !== null && t.moodAggressive !== null) {
|
|
score += (1 - Math.abs(sourceTrack.moodAggressive - t.moodAggressive)) * 0.10;
|
|
factors += 0.10;
|
|
}
|
|
|
|
// Valence - overall positivity (5%)
|
|
if (sourceTrack.valence !== null && t.valence !== null) {
|
|
score += (1 - Math.abs(sourceTrack.valence - t.valence)) * 0.05;
|
|
factors += 0.05;
|
|
}
|
|
|
|
// === AUDIO CHARACTERISTICS (35% total) ===
|
|
|
|
// BPM (15%) - within ±15 BPM is good
|
|
if (sourceTrack.bpm && t.bpm) {
|
|
const bpmDiff = Math.abs(sourceTrack.bpm - t.bpm);
|
|
score += Math.max(0, 1 - bpmDiff / 30) * 0.15;
|
|
factors += 0.15;
|
|
}
|
|
|
|
// Energy (10%)
|
|
if (sourceTrack.energy !== null && t.energy !== null) {
|
|
score += (1 - Math.abs(sourceTrack.energy - t.energy)) * 0.10;
|
|
factors += 0.10;
|
|
}
|
|
|
|
// Danceability - prefer ML version (10%)
|
|
const srcDance = sourceTrack.danceabilityMl ?? sourceTrack.danceability;
|
|
const tDance = t.danceabilityMl ?? t.danceability;
|
|
if (srcDance !== null && tDance !== null) {
|
|
score += (1 - Math.abs(srcDance - tDance)) * 0.10;
|
|
factors += 0.10;
|
|
}
|
|
|
|
// === GENRE/TAGS (15% total) ===
|
|
|
|
// Genre/tag overlap (10%)
|
|
const sourceGenres = [...(sourceTrack.lastfmTags || []), ...(sourceTrack.essentiaGenres || [])];
|
|
const trackGenres = [...(t.lastfmTags || []), ...(t.essentiaGenres || [])];
|
|
if (sourceGenres.length > 0 && trackGenres.length > 0) {
|
|
const overlap = sourceGenres.filter(g => trackGenres.includes(g)).length;
|
|
const maxOverlap = Math.max(sourceGenres.length, trackGenres.length);
|
|
score += (overlap / maxOverlap) * 0.10;
|
|
factors += 0.10;
|
|
}
|
|
|
|
// Key compatibility (5%)
|
|
if (sourceTrack.keyScale && t.keyScale) {
|
|
score += (sourceTrack.keyScale === t.keyScale ? 1 : 0.5) * 0.05;
|
|
factors += 0.05;
|
|
}
|
|
|
|
const finalScore = factors > 0 ? score / factors : 0;
|
|
return { id: t.id, score: finalScore };
|
|
});
|
|
```
|
|
|
|
### Phase 5: Create Standard Mode Fallback (Day 5)
|
|
|
|
After Enhanced mode is working, implement Standard mode:
|
|
- Same algorithm structure but skip ML features
|
|
- Use estimated valence (improved heuristics)
|
|
- Lower weights on mood matching since it's estimated
|
|
- Higher weights on BPM, energy, genre tags
|
|
|
|
### Phase 6: Settings & UI (Day 6)
|
|
|
|
#### 6.1 Add Settings Toggle
|
|
|
|
```typescript
|
|
// System settings - Enhanced is DEFAULT
|
|
{
|
|
audioAnalysis: {
|
|
vibeMatchingMode: 'enhanced' | 'standard', // Default: 'enhanced'
|
|
reanalyzeOnModeChange: boolean, // Default: false
|
|
}
|
|
}
|
|
```
|
|
|
|
#### 6.2 Settings UI
|
|
|
|
```
|
|
Audio Analysis
|
|
├── Vibe Matching Mode
|
|
│ ├── ● Enhanced (Recommended - Default)
|
|
│ │ └── Uses ML models for accurate mood detection
|
|
│ └── ○ Standard (Power Saver)
|
|
│ └── Faster, uses basic audio features only
|
|
│
|
|
├── Analysis Status
|
|
│ └── "1,234 / 1,500 tracks analyzed (Enhanced mode)"
|
|
│
|
|
└── [Re-analyze Library] button
|
|
└── "Re-analyze all tracks with current settings"
|
|
```
|
|
|
|
### Phase 7: Testing & Validation (Day 7)
|
|
|
|
#### 7.1 Test Cases
|
|
|
|
| Source Track | Bad Match (Current) | Expected Good Match |
|
|
|--------------|---------------------|---------------------|
|
|
| "Fake Happy" (Paramore) | "Summer Girl" (Jamiroquai) 97% | Other emo/pop-punk <60% |
|
|
| "Creep" (Radiohead) | Fast dance track | Other melancholic rock |
|
|
| "Uptown Funk" | Slow ballad | Other high-energy funk/pop |
|
|
|
|
#### 7.2 Performance Testing
|
|
- Analyze 100 tracks, measure time
|
|
- Memory usage during analysis
|
|
- Queue handling under load
|
|
|
|
---
|
|
|
|
## Database Schema Updates
|
|
|
|
```prisma
|
|
model Track {
|
|
// ... existing fields ...
|
|
|
|
// ML-based mood predictions (Enhanced mode)
|
|
moodHappy Float? // ML prediction 0-1
|
|
moodSad Float? // ML prediction 0-1
|
|
moodRelaxed Float? // ML prediction 0-1
|
|
moodAggressive Float? // ML prediction 0-1
|
|
danceabilityMl Float? // ML-based danceability
|
|
|
|
// Analysis metadata
|
|
analysisMode String? // 'standard' or 'enhanced'
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Performance Benchmarks (Estimated)
|
|
|
|
| Operation | Standard Mode | Enhanced Mode |
|
|
|-----------|---------------|---------------|
|
|
| Analysis per track | 1-2 sec | 5-10 sec |
|
|
| RAM usage | ~100MB | ~500MB |
|
|
| Models in Docker | N/A | ~200MB (pre-packaged) |
|
|
| Vibe match query | <100ms | <100ms |
|
|
| Full library (1000 tracks) | ~30 min | ~2-3 hours |
|
|
|
|
---
|
|
|
|
## Files to Modify
|
|
|
|
| File | Changes |
|
|
|------|---------|
|
|
| `services/audio-analyzer/Dockerfile` | Add model downloads during build |
|
|
| `services/audio-analyzer/analyzer.py` | Implement ML model loading and prediction |
|
|
| `backend/prisma/schema.prisma` | Add mood prediction columns |
|
|
| `backend/src/routes/library.ts` | Update vibe matching algorithm weights |
|
|
| `frontend/features/settings/` | Add analysis mode toggle (default: enhanced) |
|
|
| `frontend/components/player/VibeGraph.tsx` | Display mood predictions |
|
|
|
|
---
|
|
|
|
## Success Metrics
|
|
|
|
After implementation, "Fake Happy" and "Summer Girl" should:
|
|
- Match at **<50%** (different emotional content, different genre)
|
|
|
|
Better matches for "Fake Happy" would be:
|
|
- Other Paramore songs (same artist = genre/production match)
|
|
- Emo/pop-punk with similar emotional complexity
|
|
- Songs with high energy but mixed emotional signals
|
|
|
|
---
|
|
|
|
## Implementation Order (Enhanced First)
|
|
|
|
### Week 1: Get Enhanced Mode Working
|
|
1. [x] Create implementation plan (this document)
|
|
2. [x] Update Dockerfile to pre-package ML models (~200MB)
|
|
3. [x] Rewrite analyzer.py with TensorFlow model loading
|
|
4. [x] Add new database columns for mood predictions (moodHappy, moodSad, etc.)
|
|
5. [x] Update vibe matching algorithm with ML mood weights
|
|
6. [x] Update programmatic playlists to use ML mood predictions
|
|
7. [ ] Run Prisma migration to apply schema changes
|
|
8. [ ] Rebuild audio-analyzer Docker container
|
|
9. [ ] Test ML analysis on sample tracks
|
|
|
|
### Week 2: Polish & Fallback
|
|
10. [ ] Test accuracy with diverse track pairs
|
|
11. [ ] Add settings UI (Enhanced = default)
|
|
12. [ ] Implement Standard mode as explicit fallback option
|
|
13. [ ] Update VibeGraph to show mood predictions
|
|
14. [ ] Documentation and testing
|
|
|
|
---
|
|
|
|
## Quick Reference: Models to Include
|
|
|
|
| Model | File | Purpose | Size |
|
|
|-------|------|---------|------|
|
|
| Embeddings | `discogs-effnet-bs64-1.pb` | Base model for all predictions | ~85MB |
|
|
| Happy | `mood_happy-discogs-effnet-1.pb` | Happiness detection | ~15MB |
|
|
| Sad | `mood_sad-discogs-effnet-1.pb` | Sadness detection | ~15MB |
|
|
| Relaxed | `mood_relaxed-discogs-effnet-1.pb` | Relaxation detection | ~15MB |
|
|
| Aggressive | `mood_aggressive-discogs-effnet-1.pb` | Aggression detection | ~15MB |
|
|
| Arousal | `mood_arousal-discogs-effnet-1.pb` | Energy/calm scale | ~15MB |
|
|
| Valence | `mood_valence-discogs-effnet-1.pb` | Positive/negative | ~15MB |
|
|
| Danceability | `danceability-discogs-effnet-1.pb` | ML danceability | ~15MB |
|
|
| Voice/Instrumental | `voice_instrumental-discogs-effnet-1.pb` | Vocal detection | ~15MB |
|
|
|
|
**Total:** ~200MB (one-time addition to Docker image)
|
|
|