Skip to main content

Audio Embedding Pipelines

Audio is defined as any human-hearable sound; audio embedding is the process of converting audio files (mp3, wav, etc...) into vector representations. Here, we list some of our built-in pipelines for generating audio embeddings.


Audio tasks have seen incredible strides using 1-dimensional convolutional neural networks. Just as with CNNs used for image embedding, most audio embedding models include some form of preprocessing such as data cropping and downsampling. Towhee maintains the following audio embedding pipelines:


This pipeline contains a pre-trained model based on VGGish. VGGish is a supervised model trained using the AudioSet dataset, a large scale audio classification task.


This pipeline contains a pre-trained model based on CLMR, also known as Contrastive Learning of Musical Representations. CLMR is a semi-supervised encoder-based model which works well for music fingerprinting. Its performance on generic audio clips is untested.