Tutorial¶
Introduction¶
With edgel3, you can compute audio embeddings from smaller versions of L3 models that can be useful for resource constrained devices. The supported audio formats are those supported by the pysoundfile library, which is used for loading the audio (e.g. WAV, OGG, FLAC).
Using the Library¶
edgel3 supports two types of model_type
:
sparse
: sparse L3 audiosea
: SONYC-UST specialized L3 audio
The deafult audio model is 95.45% pruned and fine-tuned sparse L3 audio. You can compute audio embeddings out of default model by:
import edgel3
import soundfile as sf
audio, sr = sf.read('/path/to/file.wav')
emb, ts = edgel3.get_embedding(audio, sr)
get_embedding
returns two objects. The first object emb
is a T-by-D numpy array, where T is the number of analysis frames used to compute embeddings, and D is the dimensionality of the embedding.
The second object ts
is a length-T numpy array containing timestamps corresponding to each embedding (to the center of the analysis window, by default).
These defaults for sparse models be changed via the following optional parameters:
sparsity
: 53.5, 63.5, 72.3, 87.0, or 95.45 (default)retrain_type
: “kd”, “ft” (default)
For example, to get embedding out of 81.0% sparse audio model that has been trained with knowledge-distillation method, you can use:
import edgel3
import soundfile as sf
audio, sr = sf.read('/path/to/file.wav')
emb, ts = edgel3.get_embedding(audio, sr, model_type='sparse', retrain_type='kd', sparsity=81.0)
All sea models have reduced input representation. Moreover, models with embedding dimension < 512 also have reduced architecture. The default embedding dimension for sea
models is 128 and it can be changed with the help of emb_dim
parameter:
emb_dim
: 512, 256, 128 (default), 64
import edgel3
import soundfile as sf
audio, sr = sf.read('/path/to/file.wav')
emb, ts = edgel3.get_embedding(audio, sr, model_type='sea', emb_dim=256)
By default edgel3 will pad the beginning of the input audio signal by 0.5 seconds (half of the window size) so that the the center of the first window corresponds to the beginning of the signal (“zero centered”), and the returned timestamps correspond to the center of each window. You can disable this centering like this:
emb, ts = edgel3.get_embedding(audio, sr, center=True)
The hop size used to extract the embedding is 0.1 seconds by default (i.e. an embedding frame rate of 10 Hz). In the following example we change the hop size from 0.1 (10 frames per second) to 0.5 (2 frames per second):
emb, ts = edgel3.get_embedding(audio, sr, hop_size=0.5)
Finally, you can silence the Keras printout during inference (verbosity) by changing it from 1 (default) to 0:
emb, ts = edgel3.get_embedding(audio, sr, verbose=0)
By default, the model file is loaded from disk every time get_embedding
is called. To avoid unnecessary I/O when
processing multiple files with the same model, you can load it manually and pass it to the function via the
model
parameter:
model = edgel3.models.load_embedding_model(model_type='sparse', retrain_type='ft', sparsity=53.5)
emb1, ts1 = edgel3.get_embedding(audio1, sr1, model=model)
emb2, ts2 = edgel3.get_embedding(audio2, sr2, model=model)
Since the model is provided, keyword arguments model_type
and all parameters associated with sea
and sparse
will be ignored.
To compute embeddings for an audio file from a given model and save them to the disk, you can use process_file
:
import edgel3
import numpy as np
audio_filepath = '/path/to/file.wav'
# Save the embedding output to '/path/to/file.npz'
edgel3.process_file(audio_filepath)
# Saves the embedding output to '/path/to/file_suffix.npz'
edgel3.process_file(audio_filepath, suffix='suffix')
# Saves the embedding output to `/different/dir/file_suffix.npz`
edgel3.process_file(audio_filepath, output_dir='/different/dir', suffix='suffix')
The embddings can be loaded from disk using numpy:
import numpy as np
data = np.load('/path/to/file.npz')
emb, ts = data['embedding'], data['timestamps']
As with get_embedding
, you can load the model manually and pass it to process_file
to avoid loading the model multiple times:
import edgel3
import numpy as np
model = edgel3.models.load_embedding_model(model_type='sparse', retrain_type='ft', sparsity=53.5)
audio_filepath = '/path/to/file.wav'
# Save the embedding output to '/path/to/file.npz'
edgel3.process_file(audio_filepath, model=model)
# Saves the embedding output to '/path/to/file_suffix.npz'
edgel3.process_file(audio_filepath, model=model, suffix='suffix')
# Saves the embedding output to `/different/dir/file_suffix.npz`
edgel3.process_file(audio_filepath, model=model, output_dir='/different/dir', suffix='suffix')
Using the Command Line Interface (CLI)¶
To compute embeddings for a single file via the command line run:
$ edgel3 /path/to/file.wav
This will create an output file at /path/to/file.npz
.
You can change the output directory as follows:
$ edgel3 /path/to/file.wav --output /different/dir
This will create an output file at /different/dir/file.npz
.
You can also provide multiple input files:
$ edgel3 /path/to/file1.wav /path/to/file2.wav /path/to/file3.wav
which will create the output files /different/dir/file1.npz
, /different/dir/file2.npz
, and different/dir/file3.npz
.
You can also provide one (or more) directories to process:
$ edgel3 /path/to/audio/dir
This will process all supported audio files in the directory, though it will not recursively traverse the directory (i.e. audio files in subfolders will not be processed).
You can append a suffix to the output file as follows:
$ edgel3 /path/to/file.wav --suffix somesuffix
which will create the output file /path/to/file_somesuffix.npz
.
To get embedding out of a sea model, model_type and emb_dim can be provided
$ edgel3 /path/to/file.wav --model-type sea --emb-dim 256
To get embedding out of a sparse model, sparsity and retrain_type arguments can be provided, for example:
$ edgel3 /path/to/file.wav --model-type sparse --model-sparsity 53.5 --retrain-type kd
By default, edgel3 will pad the beginning of the input audio signal by 0.5 seconds (half of the window size) so that the the center of the first window corresponds to the beginning of the signal, and the timestamps correspond to the center of each window. You can disable this centering as follows:
$ edgel3 /path/to/file.wav --no-centering
In the following example we change the hop size from 0.1 (10 frames per second) to 0.5 (2 frames per second):
$ edgel3 /path/to/file.wav --hop-size 0.5
Finally, you can suppress non-error printouts by running:
$ edgel3 /path/to/file.wav --quiet
A sample of full command for sparse model may look like:
$ edgel3 /path/to/file.wav --output /different/dir --suffix somesuffix --model-type sparse --model-sparsity 53.5 --retrain-type kd --no-centering --hop-size 0.5 --quiet
A sample of full command for sea model may look like:
$ edgel3 /path/to/file.wav --output /different/dir --suffix somesuffix --model-type sea --emb-dim 64 --no-centering --hop-size 0.5 --quiet