Humans are so involved with each other that they have to communicate for every small thing, from a grocery vendor to an airport ticket booth. In today’s situation, people prefer to communicate via texts, voice messages instead of regular phone calls or face to face conversations. The younger generation prefers this medium as they use it in their daily life and hence, feel comfortable with it.
Communication is not just limited to text and speech. It has a wide range of modes. For example, hand gestures, face gestures, manuscript, eye gestures, Pictures, Videos, etc. Communication using a variety of these modes is what we usually call Multimodal interaction.
Conversational ai can be made better with multimodal interactions. It has the potential to change the world’s business sector in a very influential way. Many businesses have already turned to chatbots for servicing daily requests from customers. The chatbot is a fundamental entity of Conversational AI. If we allow people to talk with a machine using any gestures or media, we can truly reach the potential of an advanced AI.
Popescu in his research paper claims that, in conversational AI, user inputs generally input the agent in a single mode and get output in the same mode. While it’s usually a good idea to anticipate the user’s needs and over-deliver when you present so much content in a single-mode that the user can’t grok any of it, it becomes meaningless. So to serve the purpose of simulated human conversation, agents should have multimodal capabilities.
The modalities used are primarily the visual, auditory, and haptic ones. The number, quality, and interaction between such patterns are crucial to the realism of the simulation and eventually to its usefulness. Thus, the need for increased immersion and interaction motivates system designers to explore the integration of additional modalities in VE systems and to take advantage of cross-modal effects.
People naturally interact with the world multimodally, through both parallel and sequential use of multiple perceptual modalities. So in the area of multimodal system design, user studies have been conducted to explore how users combine modalities when interacting with real systems. Examples of modalities are typed command language, typed natural language, spoken natural language, 2D gestures with the mouse, and 2D gestures with a pen. During user interactions, “chunks” of information are transmitted across several modalities from the user to the computer and vice-versa. Related chunks of data can be grouped into higher-level entities (i.e., commands). As for “multimodality,” we can propose the following definition: “the cooperation between several modalities in order to improve the interaction.”
Architecture, Design, and Modelling:
The Simplest conversational AI, chatbot usually takes text as input and outputs the same. It can maintain the context of the conversation. Chatbot usually consists of the following components:
1.NLP Component: This component is responsible for fetching the correct response for user input
- Knowledge Base and Data Store: The above NLP layer communicates with the knowledge base to get responses for domain-specific messages, and the data store contains all the interaction history, analytics which helps maintain the context of the conversation.
Multimodal Interaction system is mainly formed using an interaction manager, fusion, and fission. Modality is any component that supports interaction.
Fusion is used to different aggregate commands coming from various modalities into one single event. Fusion is only responsible for gathering all the information. The major component in a multimodal interaction system is the interaction manager. It gets all the requests from fusion features and gives appropriate responses to the fission component. Generating different outputs(responses) over various modalities is done by fission. Fission divides a single event and provides output to different modes.
Consider a small system where a user points two locations on a map and asks for the shortest path between two. This uses two modalities, namely, touch and speech.
Here both the input requests are combined as one single event by fusion. This information then passes over to the interaction manager. It then retrieves useful data, and this data runs over a pre-trained model to generate a proper response. This response then goes over the fission module to provide output modalities like speech and text. This is not as simple as it looks here. The interaction manager has a very big and complex internal structure, which helps it get better responses each time it takes input and generates responses. This comes under the context model, which takes care of the state of the conversation.