Fluent Speech Commands: A dataset for spoken language understanding research

In recent years, with the advent of deep neural networks, the accuracy of speech recognition models have been notably improved which have made possible the production of speech-to-text systems that can accurately transcribe speech even in difficult scenarios such as noisy environments, spontaneous speech or high variability (speaking rate, accents, etc.). However, spoken language understanding (SLU) is still an open problem. Modern systems are far from being able to correctly interpret the meaning of the words uttered by a user unless the domain is highly constrained. That means that a user’s intents can be only expressed in a very limited number of ways or, in other words, using a simplified version of the language.

One of the most straightforward applications of SLU is the development of vocal interfaces for controlling different types of devices. Cellphones, smart homes, or intelligent cars are just some examples of this. Although this kind of interface is already available, they are usually constrained. Users must say the exact words or phrases on which the system has been trained in order to guarantee a high recognition accuracy. This scenario is usually frustrating for users since they have to memorize the commands to be able to use the system properly. To overcome this problem, systems should support natural language interaction and be able to deal with several variations of each intent or command. Therefore, the user can employ multiple wordings or paraphrases to interact with the interface which greatly facilitates the interaction process itself.

Releasing Fluent Speech Commands dataset

At Fluent.ai, our primary research is focused on end-to-end SLU, i.e., directly extracting the intent from speech without converting it to text first. This is somewhat similar to how humans do speech recognition. Such SLU models have caught the attention of others in the research community in recent years. However, there are not many SLU datasets readily available to the research community. Most of the available datasets are either closed source or too small. The lack of a good open-source dataset for SLU makes it impossible for most people to perform high-quality, reproducible research on this topic. To solve this problem, we created a new SLU dataset, the “Fluent Speech Commands” dataset. Specifically, Fluent Speech Commands can be employed to train and test a system able to recognize a set of spoken commands to interact with a typical voice assistant in a smart home scenario with various different wordings.

The Fluent Speech Commands dataset contains 30,043 utterances from 97 speakers. It is recorded as 16 kHz single-channel .wav files each containing a single utterance used for controlling smart-home appliances or virtual assistant, for example, “put on the music” or “turn up the heat in the kitchen”. Each audio is labeled with three slots: action, object, and location. A slot takes on one of the multiple values: for instance, the “location” slot can take on the values “none”, “kitchen”, “bedroom”, or “washroom”. We refer to the combination of slot values as the intent of the utterance. For each intent, there are multiple possible wordings: for example, the intent {action: “activate”, object: “lights”, location: “none”} can be expressed as “turn on the lights”, “switch the lights on”, “lights on”, etc. The dataset has a total of 248 phrasing mapping to 31 unique intents. The demographic information about these anonymized speakers (age range, gender, speaking ability, etc.) is included along with the dataset. The utterances are randomly divided into train, valid, and test splits in such a way that no speaker appears in more than one split. Each split contains all possible wordings for each intent, though our code has the option to include data for only certain wordings for different sets, to test the model’s ability to recognize wordings not seen during training. The dataset has a .csv file for each split that lists the speaker ID, file path, transcription, and slots for all the .wav files in that split. The splits are tabulated below:

Split

Train
Valid
Test

# of speakers

77
10
10

# of utterances

23,132
3,118
3,793

We are releasing this dataset for academic research only. It is licensed under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International license. We really hope that the research community can find this dataset useful.

License: This work is released strictly for academic research only. The dataset, in whole or in part, is not authorized to be used for any commercial purpose, including training, testing, bench-marking, or developing a product. Full license is available here.


en_USEnglish
fr_CAFrench en_USEnglish