Controlling Robots through Voice Commands

Semester programme:

ICT & Artificial Intelligence

Client company:

Safety & Security Campus

Project group members:

Kevin Geertjens
Timofey Popov
Momchil Valkov
Matthew Da Silva

Previous project Autonomous Drone Obstacle Avoidance Next projectLithium-Ion Battery Detection

Project description

This project was the first research project for the SSC to research controlling robots using speech commands. Currently, operators have to manually control robots, meaning that they don't have their hands free for other work. Using speech commands could automate this control, allowing operators to perform other tasks.

Our task in this first project was to research if the use of speech commands is feasible, and how this could best be implemented. The goal was to set up a prototype AI model to transcribe Dutch speech into one of several pre-defined commands.

Context

The Safety & Security Campus (SSC) is a collaboration between several government organisations, most notably the Ministry of Defense, Police, Firefighters, and Healthcare. They focus on innovation relevant to these sectors, and often collaborate with research and/or educational organisations such as Fontys.

The use of robots is becoming more popular among these organisations, especially in the military. The robots are currently controlled manually by operators, meaning that these operators cannot perform other tasks while controlling these robots. This is why the SSC has set up a project to research automating the control of these robots, where our project formed the first part of this research.

Results

Through our research we’ve thought up a rough process of how such a speech control system would function, starting from taking audio input from a microphone all the way to a robot performing actions. This solidified the SSC’s more vague idea for a speech control system into a tangible system that could be worked towards.
With this rough system design in place, we worked towards developing a barebones Proof-of-Concept that covers the first part of the system: transforming Dutch speech into one of several pre-defined commands. This “Speech-to-Command” system would be connected to a 3D simulation which represents a robot, making it so that the robot character could be controlled with the speech commands. We opted for a 3D simulation instead of an actual robot mainly due to time constraints. Following project groups will work on connecting the Speech-to-Command system to an actual robot.
During our research, we found a suitable Speech-to-Text AI model capable of transcribing Dutch speech into text. However, this Speech-to-Text model wasn’t perfect, and often contained mistakes in its transcriptions. Most of the effort on the Speech-to-Command system went towards mitigating the mistakes in the transcriptions. We found that background noise in the audio often resulted in spelling mistakes or loose characters being present in the transcriptions, which is why we implemented noise reduction on the audio, thereby somewhat reducing the impact of background noise.
This sadly didn’t fix the problem entirely. So, to ensure that the output of the Speech-to-Command system was always one of the pre-defined commands, we implemented mapping between the transcriptions and the defined commands. We calculate the Levenshtein Distance between the transcription and the commands, and check which command is most similar to the transcription, which will then be the output. We also introduced a similarity threshold, meaning that if the transcription isn’t similar enough to any of the commands, no output will be given. This ensures that commands are only given as output if the model is confident enough in its transcription.
This system worked well for transcribing a single speech command, but ideally we should be able to say multiple commands in quick succession. We tried to split the transcriptions of longer sentences using various NLP models, including transformers, but this still proved to be difficult. Instead, the easier solution was to perform automatic audio splitting on the input audio based on short pauses between the commands. Then, each audio chunk is transcribed into a command using the system we’d already created.
Simultaneously, we set up the 3D simulation using the Unity game engine, which contains a test environment and a character representing a robot. We set up a back-end for the Speech-to-Command system to run in, and connected it to the Unity environment through websockets. This makes it easy to connect the Speech-to-Command system to any other simulation or actual robot in the future, since it runs on a separate back-end. We then implemented basic behaviour for the character to perform, based on the command that it’s given.

About the project group

This group consists of two ICT & Software students, one ICT & Business Student, and one ICT & Media Design student, all of whom have followed the AI Core specialisation semester before attending the AI Advanced semester, during which this project took place.

Around 50% of our time during this semester was devoted to this group project. We worked with the SCRUM methodology, performing sprints of two weeks, holding a sprint review meeting every sprint to update our client regularly.

Gallery

Click an image to enlarge

Back to
ICT & Artificial Intelligence