6.5 KiB
6.5 KiB
Audio Analysis - Enhanced Mode (MusiCNN)
Overview
Enhanced mode uses Essentia's TensorFlow integration with MusiCNN (Music Convolutional Neural Network) models to perform ML-based mood and audio classification. This provides significantly more accurate mood detection compared to the heuristic-based Standard mode.
Architecture
┌─────────────────┐
│ Audio File │
│ (16kHz mono) │
└────────┬────────┘
│
┌────────▼────────┐
│ TensorflowPredict│
│ MusiCNN │
│ (Embeddings) │
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌─────────▼─────┐ ┌──────▼─────┐ ┌──────▼─────┐
│ Mood Happy │ │ Mood Sad │ │ Danceability│
│ TensorFlow │ │ TensorFlow │ │ TensorFlow │
│ Predict2D │ │ Predict2D │ │ Predict2D │
└───────┬───────┘ └─────┬──────┘ └──────┬──────┘
│ │ │
└───────────────┼───────────────┘
│
┌───────▼───────┐
│ Derived Scores│
│ Valence/Arousal│
└───────────────┘
Key Components
1. Base Model: MusiCNN
- Model:
msd-musicnn-1.pb(~3MB) - Source: Essentia Model Zoo
- Function: Extracts 200-dimensional embeddings from audio
- Algorithm:
TensorflowPredictMusiCNN
2. Classification Heads
Each classification head takes the MusiCNN embeddings and outputs probabilities:
| Model | File | Output |
|---|---|---|
| Mood Happy | mood_happy-msd-musicnn-1.pb |
P(happy) |
| Mood Sad | mood_sad-msd-musicnn-1.pb |
P(sad) |
| Mood Relaxed | mood_relaxed-msd-musicnn-1.pb |
P(relaxed) |
| Mood Aggressive | mood_aggressive-msd-musicnn-1.pb |
P(aggressive) |
| Mood Party | mood_party-msd-musicnn-1.pb |
P(party) |
| Mood Acoustic | mood_acoustic-msd-musicnn-1.pb |
P(acoustic) |
| Mood Electronic | mood_electronic-msd-musicnn-1.pb |
P(electronic) |
| Danceability | danceability-msd-musicnn-1.pb |
P(danceable) |
| Voice/Instrumental | voice_instrumental-msd-musicnn-1.pb |
P(instrumental) |
3. Derived Features
Valence and Arousal are derived from the mood predictions:
# Valence = emotional positivity
valence = happy * 0.5 + party * 0.3 + (1 - sad) * 0.2
# Arousal = energy level
arousal = aggressive * 0.35 + party * 0.25 + electronic * 0.2
+ (1 - relaxed) * 0.1 + (1 - acoustic) * 0.1
Docker Configuration
Dockerfile
FROM ubuntu:20.04
# Install essentia-tensorflow (includes TensorFlow + MusiCNN support)
RUN pip3 install --no-cache-dir essentia-tensorflow
# Download MusiCNN models
RUN curl -L -o /app/models/msd-musicnn-1.pb \
"https://essentia.upf.edu/models/autotagging/msd/msd-musicnn-1.pb"
# Classification heads
RUN curl -L -o /app/models/mood_happy-msd-musicnn-1.pb \
"https://essentia.upf.edu/models/classification-heads/mood_happy/mood_happy-msd-musicnn-1.pb"
# ... (other models)
Requirements
- Ubuntu 20.04 (for Python 3.8 compatibility)
- essentia-tensorflow pip package
- ~10MB for all models combined
Usage in Code
from essentia.standard import TensorflowPredictMusiCNN, TensorflowPredict2D
# Load base embedding model
musicnn = TensorflowPredictMusiCNN(
graphFilename='/app/models/msd-musicnn-1.pb',
output="model/dense/BiasAdd" # Embedding output layer
)
# Load classification head
mood_happy = TensorflowPredict2D(
graphFilename='/app/models/mood_happy-msd-musicnn-1.pb',
output="model/Softmax"
)
# Process audio
audio = es.MonoLoader(filename=path, sampleRate=16000)()
embeddings = musicnn(audio) # Shape: [frames, 200]
predictions = mood_happy(embeddings) # Shape: [frames, 2]
happy_score = float(np.mean(predictions[:, 1])) # Average over frames
Output Fields
Enhanced mode produces these additional fields:
| Field | Type | Range | Description |
|---|---|---|---|
| moodHappy | float | 0-1 | ML probability of happy mood |
| moodSad | float | 0-1 | ML probability of sad mood |
| moodRelaxed | float | 0-1 | ML probability of relaxed mood |
| moodAggressive | float | 0-1 | ML probability of aggressive mood |
| moodParty | float | 0-1 | ML probability of party mood |
| moodAcoustic | float | 0-1 | ML probability of acoustic sound |
| moodElectronic | float | 0-1 | ML probability of electronic sound |
| danceabilityMl | float | 0-1 | ML danceability score |
| valence | float | 0-1 | Derived emotional positivity |
| arousal | float | 0-1 | Derived energy level |
| acousticness | float | 0-1 | From moodAcoustic |
| instrumentalness | float | 0-1 | ML voice/instrumental detection |
Comparison: Standard vs Enhanced
| Feature | Standard Mode | Enhanced Mode |
|---|---|---|
| Mood Detection | Heuristic (key/BPM/energy) | ML (MusiCNN) |
| Accuracy | Approximate | Research-grade |
| Speed | Fast (~100ms) | Moderate (~500ms) |
| Dependencies | Essentia core | Essentia + TensorFlow |
| Model Size | 0 | ~10MB |
| Python Version | Any | 3.7-3.9 (for pip) |
Fallback Behavior
If Enhanced mode fails to initialize (missing models, TensorFlow errors), the analyzer automatically falls back to Standard mode:
if self.enhanced_mode and self.musicnn_model:
ml_features = self._extract_ml_features(audio_16k)
result.update(ml_features)
else:
self._apply_standard_estimates(result, scale, bpm)