Vocabulary
Review our glossary of voice and conversational AI terms to become more familiar with the vocabulary used by the Open Voice Interoperability Initiative. Below you’ll find detailed definitions and helpful examples.
A
- Architectural Pattern – a model of an important feature that occurs in all implementations of a system.
- Artificial Intelligence (AI) – also known as machine intelligence. A type of computer science focused on designing intelligent computer systems that exhibit characteristics of human behavior. AI is an academic discipline that has multiple sub-fields, such as Natural Language Processing, Neural Networks, Robotics, Speech Processing, and Machine Learning.
- Automatic Speech Recognition – also known as Speech-to-Text or computer speech recognition. It is an interdisciplinary subfield of computer science and computational linguistics that enables the recognition and translation of auditory utterance to text.
B
C
- Channel – an endpoint that enables conversation between Conversational Agents. Examples include smart devices, mobile phones, web sites, and mobile apps. See Conversational Endpoint below.
- Component – an identifiable part of a voice assistant or agent. A component provides a particular function or group of related functions.
- Component Interoperability – the ability, within a voice assistant, to replace one component with another from a different vendor. For the purposes of the Open Voice Interoperability Initiative, interoperable components may be recognized within the voice industry as open or proprietary as long as interoperability capabilities are met.
- Context – see Conversational Context below. Information extracted from n prior utterances of the current conversation. This could include some or all of the following: information that has been input, output, or inferred in Conversational Processors, and the information state of the Dialog Manager.
- Conversation – a joint activity in which two or more agents (human or automated) use linguistic forms and non-verbal signals (i.e. gestures) to communicate to achieve an outcome that meets a shared goal.
- Conversation Event – signals changes in the conversational that may be acted upon. Such an event may be at the beginning or ending of a Conversational Session, completion of a Conversation Processor, decoding of Conversation Information, changes to the state of a Conversation Endpoint, or changes to the status of a Conversation Stream. Any component with access to the system is allowed to generate a Conversation Event.
- Conversation Facilitator – coordinates communication between two or more Dialogue Systems and/or Processors during the course of one or more Sessions. This allows dialogue Systems and associated Processors to collaborate regardless of technology being used. Examples of Conversation Information include semantic, lexical, syntactic, and prosodic features.
- Conversation Information Layer – represents an abstraction of a type of information in a Dialog System. A layer may be a specific type of acoustic, linguistic, non-linguistic, or paralinguistic features. Examples of layers would be Cepstral features, Phonemes, Intonation Boundaries, Words, Phrases, Turn Boundaries, Syllabic Stress, Discourse Move Type, and specific Semantic representation schemes.
- Conversation Processors – conversation information is encoded and/or decoded by one or more Conversational Processors, also known as a Component. Conversational Processors may also take as input the output from another Conversation Processor. A Conversation Processor may generate Conversation Events and Conversation Streams.
- Conversation Session – a particular conversation that consists of two or more Conversation Streams (see below) generated by two or more agents through one or more Conversational Endpoints. Sessions may be persistent, but they will often have a start-point and an end-point in time determined by one of the Agents or some other external event.
- Conversation Stream – each Conversational Endpoint generates one or more Conversation Streams based upon the capabilities of the Endpoint and the preferences of the Agent. A Conversation Stream is associated with a particular Agent and may include any media type, including text, audio, video, and application UI events.
- Conversational Agent – a) for the purpose of this work, synonymous with conversational assistant (see below); b) a conversational assistant capable of independent action on behalf of the user.
- Conversational Assistant – a digital participant in a conversation. This may be an application with a consistent persona, such as Amazon Alexa, Google Assistant, the Target Google Assistant Action, a Facebook Messenger chatbot, or an IVR system at a bank or a human. Conversational assistant is a common term utilized in Dialogue System research and university-level instruction; the term often is used to describe a human participant in a conversation. For clarity of reference, however, OVON will use the term “user” to identify a human participant (see “User” below).
- Conversational AI – is the set of technologies to enable automated communication between computers and humans. This communication can be speech and text. Conversational AI recognizes speech and text, understands intent, deciphers various languages, and responds where it mimics human conversation. In some cases, it is also known as Natural Language Processing.
- Conversational Context – more research is required and underway; however, as of 2020.12.15, the Open Voice Interoperability Initative will use this definition: information extracted from n prior utterances of the current conversation. This could include some or all of the following: information that has been input, output, or inferred in Conversational Processors, and the information state of the Dialog Manager.
- Conversational Delegation – the passing of dialog layers and control between one Conversational Assistant and another to fulfill a user intent. The first assistant in the delegation sequence is the initiating assistant; the second is the destination assistant.
- Conversational Endpoint – agents conduct conversations using conversational endpoints; these may be a phone, mobile device, voice speaker, personal computer, kiosk, or any other device that enables an agent to participate in a conversation. Endpoints may be referred to elsewhere as a “device” or a “channel.”
- Conversational Information Packets – information that relates to a specific period of time. Packets form the input and output of Conversation.
- Conversational Mediation – the hosting of a dialog by a Conversational Assistant. In conversational mediation, the host assistant may fulfill a user intent by itself; it may access third-party data sources through API calls; or, it may introduce to the user a third-party application that is resident on the platform of the host assistant. A mediating assistant does not cede control, nor access to the data within the conversation.
- Conversational Platform – a group of technologies that are used as a base for one or more conversational agents; also (see Platform below) a business model that harnesses and creates a large, scalable network of users and resources that can be accessed on demand.
D
- Data – (per the Cambridge Dictionary): information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer. Context (see above) is a subset type of the data accessed and used by the voice assistant system.
- Dialog Manager (DM) – handles the dynamic response of the conversation. It provides a more personalized response based on the action provided by the NLP to send back to the user.
- Disambiguate – when the conversational platform hypothesizes two or more possible resolutions to a user utterance, it may ask the user for additional clarification or choose between the various interpretations to decide the user’s correct intention.
E
- Entity – is a part of the structured machine translation. It is a custom-level data type and considered a concrete value to associate a word in a query. This is also known as annotations.
- Explicit Invocation – an invocation type where the user invokes the channel, and it is explicitly stating a direct command to accomplish a specific task. The direct authority is to communicate directly to a registered voice application.
F
G
H
I
- Implicit Invocation – an invocation type where the user invokes the channel by using the most common words or indirectly saying the explicit Invocation.
- Invocation – is part of the construct of the user’s utterance during a conversation with a channel. An invocation describes a specific function that the guest wants and expects a particular response.
- Intent – is a part of the structured machine translation. It is the identified action that the machine interprets based on the user’s query. This is also known as classifiers.
- Intent Broker (IB) – is responsible for providing the fulfillable intents available for a resolved VRS record (e.g. where resolved VRS record “BigGrocery,” it’s fulfillable intents might be “order product, check order status, and add to shopping list.”) These fulfillable intents can execute remotely on the DM or download locally on the device.
J
- Jobs to be Done – an approach to learning what will cause a customer to hire or bring your product or service into their life.
K
L
M
N
- Natural Language Processing (NLP) – a service and a branch of artificial intelligence that helps computers communicate with humans in their language and scales other language-related tasks. NLP helps structure highly complicated, unstructured human utterances and vice-versa. Natural Language Understanding is a subset of NLP that is responsible for understanding the meaning of the user’s utterance and classifying it into proper intents.
O
- Organization – a group of individuals brought together for a specific purpose, including the creation, transaction, and delivery of products or services. Examples would include a for-profit business, a not-for-profit group, or a government agency.
P
- Platform – the collection of components (the environment) needed to execute a voice application. Examples of platforms include the Amazon and Google platforms that execute voice applications.
Q
- Query – user’s word requesting for specific function and expecting a particular response.
R
S
- Speech-to-Text (STT) – is converting the response from an audio to a text.
T
- Text-to-Speech (TTS) – is a text converting to audio. Also known as Automatic Speech Recognition. It includes customized models to overcome common speech recognition barriers, such as unique vocabularies, speaking styles, or background noise.
- Technical Resource – it can be a publisher/developer. It can be a representative of an entity or independent party. Their role is to create an actual listing of the voice application.
U
- Utterance – spoken or typed phrases.
- User – a person who interacts with channels.
V
- Voice Application – also known as a skill, action, capsule, or domain. This is the specific executable component that has association to multiple things, such as invocation, collection of related intents, and entities up to the configuration to your dialog manager.
- Voice Application Interoperability – a voice application that involves another voice application.
- Voice Assistant System – a system where a user (primarily) uses his/her voice to interact with an automated conversational assistant for information or to control devices.
- Voice Registry System (VRS) – is a global entity type in OVON and considered one of the most central components. It is a registry system with similarities to Domain Name System (DNS), but for voice. VRS resolves requests to dialog management endpoints, NLP providers, and the dialog broker. VRS serves consistently regardless of the NLP.
W
- Wake Word – a specific word that will catch the attention of the channel.
- Web Content Model – the content is interwoven with the mechanism to access the content.
- Web Model – the content is separated from the user mechanism to access the content.