175 lines
6.5 KiB
Markdown
175 lines
6.5 KiB
Markdown
# Audio Analysis - Enhanced Mode (MusiCNN)
|
|
|
|
## Overview
|
|
|
|
Enhanced mode uses Essentia's TensorFlow integration with MusiCNN (Music Convolutional Neural Network) models to perform ML-based mood and audio classification. This provides significantly more accurate mood detection compared to the heuristic-based Standard mode.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────────┐
|
|
│ Audio File │
|
|
│ (16kHz mono) │
|
|
└────────┬────────┘
|
|
│
|
|
┌────────▼────────┐
|
|
│ TensorflowPredict│
|
|
│ MusiCNN │
|
|
│ (Embeddings) │
|
|
└────────┬────────┘
|
|
│
|
|
┌──────────────┼──────────────┐
|
|
│ │ │
|
|
┌─────────▼─────┐ ┌──────▼─────┐ ┌──────▼─────┐
|
|
│ Mood Happy │ │ Mood Sad │ │ Danceability│
|
|
│ TensorFlow │ │ TensorFlow │ │ TensorFlow │
|
|
│ Predict2D │ │ Predict2D │ │ Predict2D │
|
|
└───────┬───────┘ └─────┬──────┘ └──────┬──────┘
|
|
│ │ │
|
|
└───────────────┼───────────────┘
|
|
│
|
|
┌───────▼───────┐
|
|
│ Derived Scores│
|
|
│ Valence/Arousal│
|
|
└───────────────┘
|
|
```
|
|
|
|
## Key Components
|
|
|
|
### 1. Base Model: MusiCNN
|
|
|
|
- **Model**: `msd-musicnn-1.pb` (~3MB)
|
|
- **Source**: [Essentia Model Zoo](https://essentia.upf.edu/models/autotagging/msd/)
|
|
- **Function**: Extracts 200-dimensional embeddings from audio
|
|
- **Algorithm**: `TensorflowPredictMusiCNN`
|
|
|
|
### 2. Classification Heads
|
|
|
|
Each classification head takes the MusiCNN embeddings and outputs probabilities:
|
|
|
|
| Model | File | Output |
|
|
|-------|------|--------|
|
|
| Mood Happy | `mood_happy-msd-musicnn-1.pb` | P(happy) |
|
|
| Mood Sad | `mood_sad-msd-musicnn-1.pb` | P(sad) |
|
|
| Mood Relaxed | `mood_relaxed-msd-musicnn-1.pb` | P(relaxed) |
|
|
| Mood Aggressive | `mood_aggressive-msd-musicnn-1.pb` | P(aggressive) |
|
|
| Mood Party | `mood_party-msd-musicnn-1.pb` | P(party) |
|
|
| Mood Acoustic | `mood_acoustic-msd-musicnn-1.pb` | P(acoustic) |
|
|
| Mood Electronic | `mood_electronic-msd-musicnn-1.pb` | P(electronic) |
|
|
| Danceability | `danceability-msd-musicnn-1.pb` | P(danceable) |
|
|
| Voice/Instrumental | `voice_instrumental-msd-musicnn-1.pb` | P(instrumental) |
|
|
|
|
### 3. Derived Features
|
|
|
|
Valence and Arousal are derived from the mood predictions:
|
|
|
|
```python
|
|
# Valence = emotional positivity
|
|
valence = happy * 0.5 + party * 0.3 + (1 - sad) * 0.2
|
|
|
|
# Arousal = energy level
|
|
arousal = aggressive * 0.35 + party * 0.25 + electronic * 0.2
|
|
+ (1 - relaxed) * 0.1 + (1 - acoustic) * 0.1
|
|
```
|
|
|
|
## Docker Configuration
|
|
|
|
### Dockerfile
|
|
|
|
```dockerfile
|
|
FROM ubuntu:20.04
|
|
|
|
# Install essentia-tensorflow (includes TensorFlow + MusiCNN support)
|
|
RUN pip3 install --no-cache-dir essentia-tensorflow
|
|
|
|
# Download MusiCNN models
|
|
RUN curl -L -o /app/models/msd-musicnn-1.pb \
|
|
"https://essentia.upf.edu/models/autotagging/msd/msd-musicnn-1.pb"
|
|
|
|
# Classification heads
|
|
RUN curl -L -o /app/models/mood_happy-msd-musicnn-1.pb \
|
|
"https://essentia.upf.edu/models/classification-heads/mood_happy/mood_happy-msd-musicnn-1.pb"
|
|
# ... (other models)
|
|
```
|
|
|
|
### Requirements
|
|
|
|
- **Ubuntu 20.04** (for Python 3.8 compatibility)
|
|
- **essentia-tensorflow** pip package
|
|
- **~10MB** for all models combined
|
|
|
|
## Usage in Code
|
|
|
|
```python
|
|
from essentia.standard import TensorflowPredictMusiCNN, TensorflowPredict2D
|
|
|
|
# Load base embedding model
|
|
musicnn = TensorflowPredictMusiCNN(
|
|
graphFilename='/app/models/msd-musicnn-1.pb',
|
|
output="model/dense/BiasAdd" # Embedding output layer
|
|
)
|
|
|
|
# Load classification head
|
|
mood_happy = TensorflowPredict2D(
|
|
graphFilename='/app/models/mood_happy-msd-musicnn-1.pb',
|
|
output="model/Softmax"
|
|
)
|
|
|
|
# Process audio
|
|
audio = es.MonoLoader(filename=path, sampleRate=16000)()
|
|
embeddings = musicnn(audio) # Shape: [frames, 200]
|
|
predictions = mood_happy(embeddings) # Shape: [frames, 2]
|
|
happy_score = float(np.mean(predictions[:, 1])) # Average over frames
|
|
```
|
|
|
|
## Output Fields
|
|
|
|
Enhanced mode produces these additional fields:
|
|
|
|
| Field | Type | Range | Description |
|
|
|-------|------|-------|-------------|
|
|
| moodHappy | float | 0-1 | ML probability of happy mood |
|
|
| moodSad | float | 0-1 | ML probability of sad mood |
|
|
| moodRelaxed | float | 0-1 | ML probability of relaxed mood |
|
|
| moodAggressive | float | 0-1 | ML probability of aggressive mood |
|
|
| moodParty | float | 0-1 | ML probability of party mood |
|
|
| moodAcoustic | float | 0-1 | ML probability of acoustic sound |
|
|
| moodElectronic | float | 0-1 | ML probability of electronic sound |
|
|
| danceabilityMl | float | 0-1 | ML danceability score |
|
|
| valence | float | 0-1 | Derived emotional positivity |
|
|
| arousal | float | 0-1 | Derived energy level |
|
|
| acousticness | float | 0-1 | From moodAcoustic |
|
|
| instrumentalness | float | 0-1 | ML voice/instrumental detection |
|
|
|
|
## Comparison: Standard vs Enhanced
|
|
|
|
| Feature | Standard Mode | Enhanced Mode |
|
|
|---------|---------------|---------------|
|
|
| Mood Detection | Heuristic (key/BPM/energy) | ML (MusiCNN) |
|
|
| Accuracy | Approximate | Research-grade |
|
|
| Speed | Fast (~100ms) | Moderate (~500ms) |
|
|
| Dependencies | Essentia core | Essentia + TensorFlow |
|
|
| Model Size | 0 | ~10MB |
|
|
| Python Version | Any | 3.7-3.9 (for pip) |
|
|
|
|
## Fallback Behavior
|
|
|
|
If Enhanced mode fails to initialize (missing models, TensorFlow errors), the analyzer automatically falls back to Standard mode:
|
|
|
|
```python
|
|
if self.enhanced_mode and self.musicnn_model:
|
|
ml_features = self._extract_ml_features(audio_16k)
|
|
result.update(ml_features)
|
|
else:
|
|
self._apply_standard_estimates(result, scale, bpm)
|
|
```
|
|
|
|
## References
|
|
|
|
- [Essentia TensorFlow Documentation](https://essentia.upf.edu/machine_learning.html)
|
|
- [MusiCNN Paper](https://arxiv.org/abs/1711.02520)
|
|
- [Essentia Model Zoo](https://essentia.upf.edu/models/)
|
|
|
|
|
|
|