Video Summary

Speech-To-Steve

Summary of the Project

The main idea of our project is using Python to implement a way for the user to control the AI agent via speech. We used natural language processing to convert audio into commands, and let the agent perform all of them in Minecraft. For example, saying “kill the cow, find a goat and then move backwards for 6 steps” into a microphone would create 3 commands: “kill cow”, “find goat”, and “move backwards(6)”, which automatically call the responding functions and let the agent to execute these commands. To accomplish this, we first converted speech to text using SpeechRecognition, and then converted the text into commands using spaCy, a library for advanced NLP, so that the agent can perform using Malmo.

Approach

We plan to use supervised machine learning algorithms for natural language processing and text analysis, such as SVM and Neural Networks, with the support of the SpeechRecognition Library. We will also use the PyAudio library to parse microphone input into audio data that can be sent to a speech recognition API. Specifically, we are planning on using the Google Cloud Speech API to parse the audio data of user-given commands, which would then be further analyzed with the spaCy library. The agent would then execute the required tasks by calling a certain function related to the command using the Malmo platform.

Part 1: Speech Recognition

We used Pyaudio to record the user’s audio, and then parsed the audio input using Google Cloud Speech API in SpeechRecognition to convert the voice commands into text. We set the energy threshold “duration” to 1 in order to recognize speech right after it starts listening. In this part, users’ voice commands will be converted into text and stored as a string.

Part 2: Information Extraction and Context Understanding

We used spaCy to call a NLP object that contains all the components and data needed to process the text, and calling the NLP object returns a processed doc object that contains all of the information of the original text as tokens.

To further process and understand our commands, we created a Command class. In the class functions, we used NeuralCoref to replace pronouns (e.g. it, my), and then iterate every token of the processed doc to extract verbs and their corresponding direct objects/numbers, and finally combine the verbs and objects to the command that can be passed to functions of Malmo we created. Specifically, verbs are extracted from the doc object by anaylyzing its part-of-speech and dependency tagging, and then store it in a list. We also created words of interest for every verb, such as its direct objects and numbers indicating how many times we need to repeat certain commands. For example, if the text of the doc object is “Go and find a sheep and cow, and then jump six times”, then the verbs and their words of interest are “go”, “find” with “sheep”, “cow”, and “jump” with “six”.

To combine this information in a way that can be processed by Malmo commands, we created a dict object with extracted verbs as keys and their words of interests as values. In addition, we implemented the similarity check so that the agent could regard the similar commands as the same one. For example, commands “go forward” and “walk forward” would call the same function walk_forward().

Part 3: Command Execution in Malmo

We have implemented continuous movement by calling built-in commands in Malmo and adjusting the sleep time in-between commands accordingly.Commands implemented as of the status report include basic movement; such as jumping, walking (in any direction), running, etc.

Evaluation

We will evaluate the success of our project based on the complexity of the commands we can implement accurately and how well the agent performs tasks. There are different tiers of difficulty for commands: “walk to the right” is much easier to implement than “find and mine a diamond.” We are aiming to implement commands that are pretty complex and interact with the environment (e.g. “place down a dirt block to the left,” “take a gold ingot from the chest”), with a moonshot case being extremely complex commands that need contextual understanding (e.g. “enter the third house on the right”).

Quantitative:

We intend to evaluate our success quantitatively by measuring the accuracy of our voice commands. In other words, we will calculate the proportion of successfully recognized voice commands to the total number of voice commands given. We will also calculate the command completion rate (e.g. the agent actually moves north when given the command to go north, the agent can recognize objects in Minecraft successfully).

Qualitative:

We intend to evaluate our success qualitatively by visually checking if the agent can actually perform commands. For example, we will check if the agent actually moves 5 blocks to the left if it is given the command “walk 5 blocks left.”

Single Command Multiple Synonym
walk to the left for 10 steps walk 10 steps, then run 10 steps to the right, and then jump 5 times hurdle 5 times go forward for 10 blocks
Voice Recognized Voice Recognized Voice Recognized
Command Excecuted Successfully Command Excecuted Successfully Command Excecuted Successfully

Remaining Goals and Challenges