Audio Analysis - Enhanced Mode (MusiCNN)

Overview

Enhanced mode uses Essentia's TensorFlow integration with MusiCNN (Music Convolutional Neural Network) models to perform ML-based mood and audio classification. This provides significantly more accurate mood detection compared to the heuristic-based Standard mode.

Architecture

                    ┌─────────────────┐
                    │  Audio File     │
                    │   (16kHz mono)  │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ TensorflowPredict│
                    │    MusiCNN      │
                    │  (Embeddings)   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
    ┌─────────▼─────┐ ┌──────▼─────┐ ┌──────▼─────┐
    │  Mood Happy   │ │  Mood Sad  │ │ Danceability│
    │ TensorFlow    │ │ TensorFlow │ │ TensorFlow  │
    │ Predict2D     │ │ Predict2D  │ │ Predict2D   │
    └───────┬───────┘ └─────┬──────┘ └──────┬──────┘
            │               │               │
            └───────────────┼───────────────┘
                            │
                    ┌───────▼───────┐
                    │ Derived Scores│
                    │ Valence/Arousal│
                    └───────────────┘

Key Components

1. Base Model: MusiCNN

Model: msd-musicnn-1.pb (~3MB)
Source: Essentia Model Zoo
Function: Extracts 200-dimensional embeddings from audio
Algorithm: TensorflowPredictMusiCNN

2. Classification Heads

Each classification head takes the MusiCNN embeddings and outputs probabilities:

Model	File	Output
Mood Happy	`mood_happy-msd-musicnn-1.pb`	P(happy)
Mood Sad	`mood_sad-msd-musicnn-1.pb`	P(sad)
Mood Relaxed	`mood_relaxed-msd-musicnn-1.pb`	P(relaxed)
Mood Aggressive	`mood_aggressive-msd-musicnn-1.pb`	P(aggressive)
Mood Party	`mood_party-msd-musicnn-1.pb`	P(party)
Mood Acoustic	`mood_acoustic-msd-musicnn-1.pb`	P(acoustic)
Mood Electronic	`mood_electronic-msd-musicnn-1.pb`	P(electronic)
Danceability	`danceability-msd-musicnn-1.pb`	P(danceable)
Voice/Instrumental	`voice_instrumental-msd-musicnn-1.pb`	P(instrumental)

3. Derived Features

Valence and Arousal are derived from the mood predictions:

# Valence = emotional positivity
valence = happy * 0.5 + party * 0.3 + (1 - sad) * 0.2

# Arousal = energy level
arousal = aggressive * 0.35 + party * 0.25 + electronic * 0.2 
        + (1 - relaxed) * 0.1 + (1 - acoustic) * 0.1

Docker Configuration

Dockerfile

FROM ubuntu:20.04

# Install essentia-tensorflow (includes TensorFlow + MusiCNN support)
RUN pip3 install --no-cache-dir essentia-tensorflow

# Download MusiCNN models
RUN curl -L -o /app/models/msd-musicnn-1.pb \
    "https://essentia.upf.edu/models/autotagging/msd/msd-musicnn-1.pb"

# Classification heads
RUN curl -L -o /app/models/mood_happy-msd-musicnn-1.pb \
    "https://essentia.upf.edu/models/classification-heads/mood_happy/mood_happy-msd-musicnn-1.pb"
# ... (other models)

Requirements

Ubuntu 20.04 (for Python 3.8 compatibility)
essentia-tensorflow pip package
~10MB for all models combined

Usage in Code

from essentia.standard import TensorflowPredictMusiCNN, TensorflowPredict2D

# Load base embedding model
musicnn = TensorflowPredictMusiCNN(
    graphFilename='/app/models/msd-musicnn-1.pb',
    output="model/dense/BiasAdd"  # Embedding output layer
)

# Load classification head
mood_happy = TensorflowPredict2D(
    graphFilename='/app/models/mood_happy-msd-musicnn-1.pb',
    output="model/Softmax"
)

# Process audio
audio = es.MonoLoader(filename=path, sampleRate=16000)()
embeddings = musicnn(audio)  # Shape: [frames, 200]
predictions = mood_happy(embeddings)  # Shape: [frames, 2]
happy_score = float(np.mean(predictions[:, 1]))  # Average over frames

Output Fields

Enhanced mode produces these additional fields:

Field	Type	Range	Description
moodHappy	float	0-1	ML probability of happy mood
moodSad	float	0-1	ML probability of sad mood
moodRelaxed	float	0-1	ML probability of relaxed mood
moodAggressive	float	0-1	ML probability of aggressive mood
moodParty	float	0-1	ML probability of party mood
moodAcoustic	float	0-1	ML probability of acoustic sound
moodElectronic	float	0-1	ML probability of electronic sound
danceabilityMl	float	0-1	ML danceability score
valence	float	0-1	Derived emotional positivity
arousal	float	0-1	Derived energy level
acousticness	float	0-1	From moodAcoustic
instrumentalness	float	0-1	ML voice/instrumental detection

Comparison: Standard vs Enhanced

Feature	Standard Mode	Enhanced Mode
Mood Detection	Heuristic (key/BPM/energy)	ML (MusiCNN)
Accuracy	Approximate	Research-grade
Speed	Fast (~100ms)	Moderate (~500ms)
Dependencies	Essentia core	Essentia + TensorFlow
Model Size	0	~10MB
Python Version	Any	3.7-3.9 (for pip)

Fallback Behavior

If Enhanced mode fails to initialize (missing models, TensorFlow errors), the analyzer automatically falls back to Standard mode:

if self.enhanced_mode and self.musicnn_model:
    ml_features = self._extract_ml_features(audio_16k)
    result.update(ml_features)
else:
    self._apply_standard_estimates(result, scale, bpm)

6.5 KiB Raw Blame History