Files
lidify/docs/implementation-summaries/audio-analysis-standard-mode/ENHANCED_MODE.md
2025-12-25 18:58:06 -06:00

175 lines
6.5 KiB
Markdown

# Audio Analysis - Enhanced Mode (MusiCNN)
## Overview
Enhanced mode uses Essentia's TensorFlow integration with MusiCNN (Music Convolutional Neural Network) models to perform ML-based mood and audio classification. This provides significantly more accurate mood detection compared to the heuristic-based Standard mode.
## Architecture
```
┌─────────────────┐
│ Audio File │
│ (16kHz mono) │
└────────┬────────┘
┌────────▼────────┐
│ TensorflowPredict│
│ MusiCNN │
│ (Embeddings) │
└────────┬────────┘
┌──────────────┼──────────────┐
│ │ │
┌─────────▼─────┐ ┌──────▼─────┐ ┌──────▼─────┐
│ Mood Happy │ │ Mood Sad │ │ Danceability│
│ TensorFlow │ │ TensorFlow │ │ TensorFlow │
│ Predict2D │ │ Predict2D │ │ Predict2D │
└───────┬───────┘ └─────┬──────┘ └──────┬──────┘
│ │ │
└───────────────┼───────────────┘
┌───────▼───────┐
│ Derived Scores│
│ Valence/Arousal│
└───────────────┘
```
## Key Components
### 1. Base Model: MusiCNN
- **Model**: `msd-musicnn-1.pb` (~3MB)
- **Source**: [Essentia Model Zoo](https://essentia.upf.edu/models/autotagging/msd/)
- **Function**: Extracts 200-dimensional embeddings from audio
- **Algorithm**: `TensorflowPredictMusiCNN`
### 2. Classification Heads
Each classification head takes the MusiCNN embeddings and outputs probabilities:
| Model | File | Output |
|-------|------|--------|
| Mood Happy | `mood_happy-msd-musicnn-1.pb` | P(happy) |
| Mood Sad | `mood_sad-msd-musicnn-1.pb` | P(sad) |
| Mood Relaxed | `mood_relaxed-msd-musicnn-1.pb` | P(relaxed) |
| Mood Aggressive | `mood_aggressive-msd-musicnn-1.pb` | P(aggressive) |
| Mood Party | `mood_party-msd-musicnn-1.pb` | P(party) |
| Mood Acoustic | `mood_acoustic-msd-musicnn-1.pb` | P(acoustic) |
| Mood Electronic | `mood_electronic-msd-musicnn-1.pb` | P(electronic) |
| Danceability | `danceability-msd-musicnn-1.pb` | P(danceable) |
| Voice/Instrumental | `voice_instrumental-msd-musicnn-1.pb` | P(instrumental) |
### 3. Derived Features
Valence and Arousal are derived from the mood predictions:
```python
# Valence = emotional positivity
valence = happy * 0.5 + party * 0.3 + (1 - sad) * 0.2
# Arousal = energy level
arousal = aggressive * 0.35 + party * 0.25 + electronic * 0.2
+ (1 - relaxed) * 0.1 + (1 - acoustic) * 0.1
```
## Docker Configuration
### Dockerfile
```dockerfile
FROM ubuntu:20.04
# Install essentia-tensorflow (includes TensorFlow + MusiCNN support)
RUN pip3 install --no-cache-dir essentia-tensorflow
# Download MusiCNN models
RUN curl -L -o /app/models/msd-musicnn-1.pb \
"https://essentia.upf.edu/models/autotagging/msd/msd-musicnn-1.pb"
# Classification heads
RUN curl -L -o /app/models/mood_happy-msd-musicnn-1.pb \
"https://essentia.upf.edu/models/classification-heads/mood_happy/mood_happy-msd-musicnn-1.pb"
# ... (other models)
```
### Requirements
- **Ubuntu 20.04** (for Python 3.8 compatibility)
- **essentia-tensorflow** pip package
- **~10MB** for all models combined
## Usage in Code
```python
from essentia.standard import TensorflowPredictMusiCNN, TensorflowPredict2D
# Load base embedding model
musicnn = TensorflowPredictMusiCNN(
graphFilename='/app/models/msd-musicnn-1.pb',
output="model/dense/BiasAdd" # Embedding output layer
)
# Load classification head
mood_happy = TensorflowPredict2D(
graphFilename='/app/models/mood_happy-msd-musicnn-1.pb',
output="model/Softmax"
)
# Process audio
audio = es.MonoLoader(filename=path, sampleRate=16000)()
embeddings = musicnn(audio) # Shape: [frames, 200]
predictions = mood_happy(embeddings) # Shape: [frames, 2]
happy_score = float(np.mean(predictions[:, 1])) # Average over frames
```
## Output Fields
Enhanced mode produces these additional fields:
| Field | Type | Range | Description |
|-------|------|-------|-------------|
| moodHappy | float | 0-1 | ML probability of happy mood |
| moodSad | float | 0-1 | ML probability of sad mood |
| moodRelaxed | float | 0-1 | ML probability of relaxed mood |
| moodAggressive | float | 0-1 | ML probability of aggressive mood |
| moodParty | float | 0-1 | ML probability of party mood |
| moodAcoustic | float | 0-1 | ML probability of acoustic sound |
| moodElectronic | float | 0-1 | ML probability of electronic sound |
| danceabilityMl | float | 0-1 | ML danceability score |
| valence | float | 0-1 | Derived emotional positivity |
| arousal | float | 0-1 | Derived energy level |
| acousticness | float | 0-1 | From moodAcoustic |
| instrumentalness | float | 0-1 | ML voice/instrumental detection |
## Comparison: Standard vs Enhanced
| Feature | Standard Mode | Enhanced Mode |
|---------|---------------|---------------|
| Mood Detection | Heuristic (key/BPM/energy) | ML (MusiCNN) |
| Accuracy | Approximate | Research-grade |
| Speed | Fast (~100ms) | Moderate (~500ms) |
| Dependencies | Essentia core | Essentia + TensorFlow |
| Model Size | 0 | ~10MB |
| Python Version | Any | 3.7-3.9 (for pip) |
## Fallback Behavior
If Enhanced mode fails to initialize (missing models, TensorFlow errors), the analyzer automatically falls back to Standard mode:
```python
if self.enhanced_mode and self.musicnn_model:
ml_features = self._extract_ml_features(audio_16k)
result.update(ml_features)
else:
self._apply_standard_estimates(result, scale, bpm)
```
## References
- [Essentia TensorFlow Documentation](https://essentia.upf.edu/machine_learning.html)
- [MusiCNN Paper](https://arxiv.org/abs/1711.02520)
- [Essentia Model Zoo](https://essentia.upf.edu/models/)