The main idea of our project is using Python to implement a way for the user to control the AI agent via speech. We used natural language processing to convert audio into commands, and let the agent perform all of them in Minecraft. For example, saying “kill the cow, find a goat and then move backwards for 6 steps” into a microphone would create 3 commands: “kill cow”, “find goat”, and “move backwards(6)”, which automatically call the responding functions and let the agent to execute these commands. To accomplish this, we first converted speech to text using SpeechRecognition, and then converted the text into commands using spaCy, a library for advanced NLP, so that the agent can perform using Malmo.
We plan to use supervised machine learning algorithms for natural language processing and text analysis, such as SVM and Neural Networks, with the support of the SpeechRecognition Library. We will also use the PyAudio library to parse microphone input into audio data that can be sent to a speech recognition API. Specifically, we are planning on using the Google Cloud Speech API to parse the audio data of user-given commands, which would then be further analyzed with the spaCy library. The agent would then execute the required tasks by calling a certain function related to the command using the Malmo platform.
We used Pyaudio to record the user’s audio, and then parsed the audio input using Google Cloud Speech API in SpeechRecognition to convert the voice commands into text. We set the energy threshold “duration” to 1 in order to recognize speech right after it starts listening. In this part, users’ voice commands will be converted into text and stored as a string.
We used spaCy to call a NLP object that contains all the components and data needed to process the text, and calling the NLP object returns a processed doc object that contains all of the information of the original text as tokens.
To further process and understand our commands, we created a Command class. In the class functions, we used NeuralCoref to replace pronouns (e.g. it, my), and then iterate every token of the processed doc to extract verbs and their corresponding direct objects/numbers, and finally combine the verbs and objects to the command that can be passed to functions of Malmo we created. Specifically, verbs are extracted from the doc object by anaylyzing its part-of-speech and dependency tagging, and then store it in a list. We also created words of interest for every verb, such as its direct objects and numbers indicating how many times we need to repeat certain commands. For example, if the text of the doc object is “Go and find a sheep and cow, and then jump six times”, then the verbs and their words of interest are “go”, “find” with “sheep”, “cow”, and “jump” with “six”.
To combine this information in a way that can be processed by Malmo commands, we created a dict object with extracted verbs as keys and their words of interests as values. In addition, we implemented the similarity check so that the agent could regard the similar commands as the same one. For example, commands “go forward” and “walk forward” would call the same function walk_forward().
We have implemented continuous movement by calling built-in commands in Malmo and adjusting the sleep time in-between commands accordingly.Commands implemented as of the status report include basic movement; such as jumping, walking (in any direction), running, etc.
We will evaluate the success of our project based on the complexity of the commands we can implement accurately and how well the agent performs tasks. There are different tiers of difficulty for commands: “walk to the right” is much easier to implement than “find and mine a diamond.” We are aiming to implement commands that are pretty complex and interact with the environment (e.g. “place down a dirt block to the left,” “take a gold ingot from the chest”), with a moonshot case being extremely complex commands that need contextual understanding (e.g. “enter the third house on the right”).
We intend to evaluate our success quantitatively by measuring the accuracy of our voice commands. In other words, we will calculate the proportion of successfully recognized voice commands to the total number of voice commands given. We will also calculate the command completion rate (e.g. the agent actually moves north when given the command to go north, the agent can recognize objects in Minecraft successfully).
We intend to evaluate our success qualitatively by visually checking if the agent can actually perform commands. For example, we will check if the agent actually moves 5 blocks to the left if it is given the command “walk 5 blocks left.”
Single Command | Multiple | Synonym |
---|---|---|
walk to the left for 10 steps | walk 10 steps, then run 10 steps to the right, and then jump 5 times | hurdle 5 times go forward for 10 blocks |
Voice Recognized | Voice Recognized | Voice Recognized |
Command Excecuted Successfully | Command Excecuted Successfully | Command Excecuted Successfully |
Information Extraction and Filtering Since this project is about voice recognition, the commands that our users give might have a lot of words that are semantically meaningless to the content of the commands. Take a basic statement for an example: “go” is useless in the sentence “go and find a cow”, so we need to filter such words. Although we are able to filter many of these words and extract the most useful ones, we still need to improve the accuracy so that our agent can execute every commands given with desirable performance.
Understand More Complex Commands At this stage, our agent is able to recognize single, multiple and synonymous commands. We have not yet finished implementing more advanced commands, such as “kill the cow,” so this part needs to be further tested.
Similarity Check In addition, we are encountering the problem that our agent may regard “run” and “walk” as the same command if we activate our similarity check. It’s hard to define similarity between words; for example, the statements “I like pizza” and “I like flowers” both talk about one’s preferences, but are dissimiliar in regards to their categorization.