Files
lidify/docs/implementation-summaries/audio-analysis-standard-mode/ENHANCED_MODE.md
2025-12-25 18:58:06 -06:00

6.5 KiB

Audio Analysis - Enhanced Mode (MusiCNN)

Overview

Enhanced mode uses Essentia's TensorFlow integration with MusiCNN (Music Convolutional Neural Network) models to perform ML-based mood and audio classification. This provides significantly more accurate mood detection compared to the heuristic-based Standard mode.

Architecture

                    ┌─────────────────┐
                    │  Audio File     │
                    │   (16kHz mono)  │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │ TensorflowPredict│
                    │    MusiCNN      │
                    │  (Embeddings)   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
    ┌─────────▼─────┐ ┌──────▼─────┐ ┌──────▼─────┐
    │  Mood Happy   │ │  Mood Sad  │ │ Danceability│
    │ TensorFlow    │ │ TensorFlow │ │ TensorFlow  │
    │ Predict2D     │ │ Predict2D  │ │ Predict2D   │
    └───────┬───────┘ └─────┬──────┘ └──────┬──────┘
            │               │               │
            └───────────────┼───────────────┘
                            │
                    ┌───────▼───────┐
                    │ Derived Scores│
                    │ Valence/Arousal│
                    └───────────────┘

Key Components

1. Base Model: MusiCNN

  • Model: msd-musicnn-1.pb (~3MB)
  • Source: Essentia Model Zoo
  • Function: Extracts 200-dimensional embeddings from audio
  • Algorithm: TensorflowPredictMusiCNN

2. Classification Heads

Each classification head takes the MusiCNN embeddings and outputs probabilities:

Model File Output
Mood Happy mood_happy-msd-musicnn-1.pb P(happy)
Mood Sad mood_sad-msd-musicnn-1.pb P(sad)
Mood Relaxed mood_relaxed-msd-musicnn-1.pb P(relaxed)
Mood Aggressive mood_aggressive-msd-musicnn-1.pb P(aggressive)
Mood Party mood_party-msd-musicnn-1.pb P(party)
Mood Acoustic mood_acoustic-msd-musicnn-1.pb P(acoustic)
Mood Electronic mood_electronic-msd-musicnn-1.pb P(electronic)
Danceability danceability-msd-musicnn-1.pb P(danceable)
Voice/Instrumental voice_instrumental-msd-musicnn-1.pb P(instrumental)

3. Derived Features

Valence and Arousal are derived from the mood predictions:

# Valence = emotional positivity
valence = happy * 0.5 + party * 0.3 + (1 - sad) * 0.2

# Arousal = energy level
arousal = aggressive * 0.35 + party * 0.25 + electronic * 0.2 
        + (1 - relaxed) * 0.1 + (1 - acoustic) * 0.1

Docker Configuration

Dockerfile

FROM ubuntu:20.04

# Install essentia-tensorflow (includes TensorFlow + MusiCNN support)
RUN pip3 install --no-cache-dir essentia-tensorflow

# Download MusiCNN models
RUN curl -L -o /app/models/msd-musicnn-1.pb \
    "https://essentia.upf.edu/models/autotagging/msd/msd-musicnn-1.pb"

# Classification heads
RUN curl -L -o /app/models/mood_happy-msd-musicnn-1.pb \
    "https://essentia.upf.edu/models/classification-heads/mood_happy/mood_happy-msd-musicnn-1.pb"
# ... (other models)

Requirements

  • Ubuntu 20.04 (for Python 3.8 compatibility)
  • essentia-tensorflow pip package
  • ~10MB for all models combined

Usage in Code

from essentia.standard import TensorflowPredictMusiCNN, TensorflowPredict2D

# Load base embedding model
musicnn = TensorflowPredictMusiCNN(
    graphFilename='/app/models/msd-musicnn-1.pb',
    output="model/dense/BiasAdd"  # Embedding output layer
)

# Load classification head
mood_happy = TensorflowPredict2D(
    graphFilename='/app/models/mood_happy-msd-musicnn-1.pb',
    output="model/Softmax"
)

# Process audio
audio = es.MonoLoader(filename=path, sampleRate=16000)()
embeddings = musicnn(audio)  # Shape: [frames, 200]
predictions = mood_happy(embeddings)  # Shape: [frames, 2]
happy_score = float(np.mean(predictions[:, 1]))  # Average over frames

Output Fields

Enhanced mode produces these additional fields:

Field Type Range Description
moodHappy float 0-1 ML probability of happy mood
moodSad float 0-1 ML probability of sad mood
moodRelaxed float 0-1 ML probability of relaxed mood
moodAggressive float 0-1 ML probability of aggressive mood
moodParty float 0-1 ML probability of party mood
moodAcoustic float 0-1 ML probability of acoustic sound
moodElectronic float 0-1 ML probability of electronic sound
danceabilityMl float 0-1 ML danceability score
valence float 0-1 Derived emotional positivity
arousal float 0-1 Derived energy level
acousticness float 0-1 From moodAcoustic
instrumentalness float 0-1 ML voice/instrumental detection

Comparison: Standard vs Enhanced

Feature Standard Mode Enhanced Mode
Mood Detection Heuristic (key/BPM/energy) ML (MusiCNN)
Accuracy Approximate Research-grade
Speed Fast (~100ms) Moderate (~500ms)
Dependencies Essentia core Essentia + TensorFlow
Model Size 0 ~10MB
Python Version Any 3.7-3.9 (for pip)

Fallback Behavior

If Enhanced mode fails to initialize (missing models, TensorFlow errors), the analyzer automatically falls back to Standard mode:

if self.enhanced_mode and self.musicnn_model:
    ml_features = self._extract_ml_features(audio_16k)
    result.update(ml_features)
else:
    self._apply_standard_estimates(result, scale, bpm)

References