lidify/docs/implementation-summaries/audio-analysis-standard-mode/ENHANCED_MODE.md

# Audio Analysis - Enhanced Mode (MusiCNN)

## Overview

Enhanced mode uses Essentia's TensorFlow integration with MusiCNN (Music Convolutional Neural Network) models to perform ML-based mood and audio classification. This provides significantly more accurate mood detection compared to the heuristic-based Standard mode.

## Architecture

```
                    ┌─────────────────┐
                    │  Audio File     │
                    │   (16kHz mono)  │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ TensorflowPredict│
                    │    MusiCNN      │
                    │  (Embeddings)   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
    ┌─────────▼─────┐ ┌──────▼─────┐ ┌──────▼─────┐
    │  Mood Happy   │ │  Mood Sad  │ │ Danceability│
    │ TensorFlow    │ │ TensorFlow │ │ TensorFlow  │
    │ Predict2D     │ │ Predict2D  │ │ Predict2D   │
    └───────┬───────┘ └─────┬──────┘ └──────┬──────┘
            │               │               │
            └───────────────┼───────────────┘
                            │
                    ┌───────▼───────┐
                    │ Derived Scores│
                    │ Valence/Arousal│
                    └───────────────┘
```

## Key Components

### 1. Base Model: MusiCNN

- **Model**: `msd-musicnn-1.pb` (~3MB)
- **Source**: [Essentia Model Zoo](https://essentia.upf.edu/models/autotagging/msd/)
- **Function**: Extracts 200-dimensional embeddings from audio
- **Algorithm**: `TensorflowPredictMusiCNN`

### 2. Classification Heads

Each classification head takes the MusiCNN embeddings and outputs probabilities:

| Model | File | Output |
|-------|------|--------|
| Mood Happy | `mood_happy-msd-musicnn-1.pb` | P(happy) |
| Mood Sad | `mood_sad-msd-musicnn-1.pb` | P(sad) |
| Mood Relaxed | `mood_relaxed-msd-musicnn-1.pb` | P(relaxed) |
| Mood Aggressive | `mood_aggressive-msd-musicnn-1.pb` | P(aggressive) |
| Mood Party | `mood_party-msd-musicnn-1.pb` | P(party) |
| Mood Acoustic | `mood_acoustic-msd-musicnn-1.pb` | P(acoustic) |
| Mood Electronic | `mood_electronic-msd-musicnn-1.pb` | P(electronic) |
| Danceability | `danceability-msd-musicnn-1.pb` | P(danceable) |
| Voice/Instrumental | `voice_instrumental-msd-musicnn-1.pb` | P(instrumental) |

### 3. Derived Features

Valence and Arousal are derived from the mood predictions:

```python
# Valence = emotional positivity
valence = happy * 0.5 + party * 0.3 + (1 - sad) * 0.2

# Arousal = energy level
arousal = aggressive * 0.35 + party * 0.25 + electronic * 0.2
        + (1 - relaxed) * 0.1 + (1 - acoustic) * 0.1
```

## Docker Configuration

### Dockerfile

```dockerfile
FROM ubuntu:20.04

# Install essentia-tensorflow (includes TensorFlow + MusiCNN support)
RUN pip3 install --no-cache-dir essentia-tensorflow

# Download MusiCNN models
RUN curl -L -o /app/models/msd-musicnn-1.pb \
    "https://essentia.upf.edu/models/autotagging/msd/msd-musicnn-1.pb"

# Classification heads
RUN curl -L -o /app/models/mood_happy-msd-musicnn-1.pb \
    "https://essentia.upf.edu/models/classification-heads/mood_happy/mood_happy-msd-musicnn-1.pb"
# ... (other models)
```

### Requirements

- **Ubuntu 20.04** (for Python 3.8 compatibility)
- **essentia-tensorflow** pip package
- **~10MB** for all models combined

## Usage in Code

```python
from essentia.standard import TensorflowPredictMusiCNN, TensorflowPredict2D

# Load base embedding model
musicnn = TensorflowPredictMusiCNN(
    graphFilename='/app/models/msd-musicnn-1.pb',
    output="model/dense/BiasAdd"  # Embedding output layer
)

# Load classification head
mood_happy = TensorflowPredict2D(
    graphFilename='/app/models/mood_happy-msd-musicnn-1.pb',
    output="model/Softmax"
)

# Process audio
audio = es.MonoLoader(filename=path, sampleRate=16000)()
embeddings = musicnn(audio)  # Shape: [frames, 200]
predictions = mood_happy(embeddings)  # Shape: [frames, 2]
happy_score = float(np.mean(predictions[:, 1]))  # Average over frames
```

## Output Fields

Enhanced mode produces these additional fields:

| Field | Type | Range | Description |
|-------|------|-------|-------------|
| moodHappy | float | 0-1 | ML probability of happy mood |
| moodSad | float | 0-1 | ML probability of sad mood |
| moodRelaxed | float | 0-1 | ML probability of relaxed mood |
| moodAggressive | float | 0-1 | ML probability of aggressive mood |
| moodParty | float | 0-1 | ML probability of party mood |
| moodAcoustic | float | 0-1 | ML probability of acoustic sound |
| moodElectronic | float | 0-1 | ML probability of electronic sound |
| danceabilityMl | float | 0-1 | ML danceability score |
| valence | float | 0-1 | Derived emotional positivity |
| arousal | float | 0-1 | Derived energy level |
| acousticness | float | 0-1 | From moodAcoustic |
| instrumentalness | float | 0-1 | ML voice/instrumental detection |

## Comparison: Standard vs Enhanced

| Feature | Standard Mode | Enhanced Mode |
|---------|---------------|---------------|
| Mood Detection | Heuristic (key/BPM/energy) | ML (MusiCNN) |
| Accuracy | Approximate | Research-grade |
| Speed | Fast (~100ms) | Moderate (~500ms) |
| Dependencies | Essentia core | Essentia + TensorFlow |
| Model Size | 0 | ~10MB |
| Python Version | Any | 3.7-3.9 (for pip) |

## Fallback Behavior

If Enhanced mode fails to initialize (missing models, TensorFlow errors), the analyzer automatically falls back to Standard mode:

```python
if self.enhanced_mode and self.musicnn_model:
    ml_features = self._extract_ml_features(audio_16k)
    result.update(ml_features)
else:
    self._apply_standard_estimates(result, scale, bpm)
```

## References

- [Essentia TensorFlow Documentation](https://essentia.upf.edu/machine_learning.html)
- [MusiCNN Paper](https://arxiv.org/abs/1711.02520)
- [Essentia Model Zoo](https://essentia.upf.edu/models/)