Building Conversational AI for Intelligent Meeting Scheduling

This is the first of a 4-part series on CalendarHero’s conversational AI. We’ll be deconstructing the way we use our award-winning technology to help CalendarHero customers schedule meetings quickly and intelligently.

The goal of conversation AI is to enable computers to interact with humans through text or voice, and the result is often a simulation of a real conversation. The purpose behind using conversational AI is to provide a service in an intuitive and efficient manner, where efficiency is determined by the solution’s ease of use and how long it takes to fulfill a request’s requirements.

To guarantee efficiency, a conversational UI should understand the user's request (natural language understanding), and guide them through a set of questions and answers (conversational flow), in order to validate parts of their request that may not be clear to the computer.

We will go through several aspects of the conversational UI technology that we have built and continuously improve on at CalendarHero. Those aspects include architectural overview, natural language understanding, conversational flow management, skill sets and say service. In today’s post, we’ll cover the architectural overview and the first part of our discussion on natural language understanding.

Below is a schematic representation of what we've built. Keep reading for a breakdown of each topic.

Architectural Overview: Four Pieces of the Puzzle

Natural Language Understanding

Our user's query is first processed by the natural language understanding unit, also known as NLU. That's where we determine the intent of their message and outline any entities that they may have used in the body of their request. For example, when the user says "postpone my call with Cara to next week", the NLU unit determines that the intent of the request is to reschedule a meeting, that the meeting is with Cara (entity of person type), and that the desired new time for the meeting is sometime next week (entity of temporal type).

Flow Management

This information is then passed to the conversational flow management unit where the state of the conversation is traced. The state of the conversation is comprised of some of the previous intents in the context of the current conversation, as well as all the entities associated with those intents. In other words, this unit plays the role of a short-term memory that keeps track of what has been talked about so far. For our example, since there has been no previous request, no specific operation is performed and the current intent and entities are passed on to the skill sets unit.

Skill Sets

Skill set unit is comprised of apps and integrations. An example of one of our major apps is the meeting app that captures all the requirements for request types that have to do with managing the meeting ecosystem. Each app works closely with potentially several integrations. For example, the meeting app taps into the user's contacts as well as their calendar and email. The relevant app to request is determined by the intent of the request. To fulfill the request, the app calls the corresponding action to the intent, and checks if all the requirements for performing that request are met. The most basic requirements are the entity types. For our example, the minimum requirement for rescheduling a meeting is a reference to the meeting, whether it’s the name of the attendee, the subject of the meeting, or the time. Depending on whether all those required pieces of information are available, the app chooses the most appropriate response and sends it to the say service.

Say Service

The say service takes the output of an app, and depending on the user's messaging platform formats the response to fit the requirements of the platform. The output of this unit is seen by the user.

Natural Language Understanding

Entity Recognition

Entities are pieces of information that give context to a query. In the query “postpone my call with Cara and Joe to next week,” there are three entities: 1) call, which is the type of meeting, 2) Cara and Joe, he invitees to the meeting, and 3) next week, which is the proposed time for the meeting. Respectively, these are meeting type, person, and temporal entity types. Each of these entities allow us to specify our request in a certain way.

The problem of recognizing the type and boundary of each entity is commonly known as named entity recognition (NER). Depending on application, extraction of different entities may be required.

Entity Types

Different types of entities that our natural language processing engine recognizes include:

Temporal

These are the most common entity types and are used by different apps such as meeting, search, CRM, and DocGen, and their main purpose is to limit the focus of a request to a certain period in time.

Three atomic types of temporal entities are:

Date and time that point to a specific time interval or a point in time. For example, "2020", "April 2020", "next Monday", "Friday morning", and "2pm".
Duration that modifies the length of a time interval. For example, "last two hours", "two weeks", and "1 day".
Frequency that specifies repetition. For example, "every other day" and "every Monday".

Realistically, a temporal entity is comprised of more than one of the atomic types. For example, "last two weeks of January", has both duration and date types; "next Monday morning", has both date and time types; and "every Friday in March", has both date and frequency types.

Person

This a person's name and could be comprised of first name and/or last name, of one or more persons. Person entities are used by apps such as meeting and CRM. For example, "Cara and Joe" is an example of this entity in the query "reschedule the meeting with Cara and Joe to next week".

Team

This is meant to facilitate the inclusion of several people in one request, and are mainly used by the meeting app. For example, "cancel the meeting with the marketing team".

Job title

This is mainly used by the search app, and is used to search for contacts with specific job titles. For example, contacts can be searched by their titles such as, e.g., "Sales", "Engineering", and "VP".

Document

This is mainly used by the DocGen app, and captures the name or type of a document. For example, "generate an NDA document for Jon Smith".

Room

This is a sub-type of location entity type, and specifies a room, as used by the meeting app for room booking. For example, "the boardroom", can be used to specify the room in which a meeting is to take place.

Email address

This is an alternative to the person type, and is used interchangeably with that type. For example, we recognize queries such as "schedule a coffee with [email protected]”.

Organization

This is the name of an organization and is mainly used by the CRM app. For example, "show me the deals from MIT Press" recognizes that we're solely interested in deals with a certain company: "MIT Press".

Subject

This is mainly used to set the subject of a meeting, or search for meetings by their subject. Subject entities are among the less structured types and may take many different forms. For example, "about the CalendarHero's natural language understanding engine", is a valid way of specifying the subject of a meeting.

Optional

This is an entity that specifies how essential the presence of other entities are. For example, together with the person type, it would be used to show that the presence of a specific person may not be obligatory.

Location

These are entities that are commonly used in the context of the meeting app and are mainly meant to specify the address of a meeting. For example, "325 Front St W, 4th floor" is recognized as the street address, and can be used to set the location of a meeting.

Phone number

This could also be viewed as a sub-type of the location type, and is most commonly used by the meeting app. For example, "call with Alexey at (555) 555-5555", recognizes the phone number for the call.

Expandability

We have built the entity recognition piece of our technology in a way that is easily expandable to other entity types. This is done through the definition of custom entities.

Recognition of Entities

In order to recognize different entity types, our natural language understanding engine uses two different types of algorithms: text-based and graph-based. Whereas text-based algorithms depend solely on text, graph-based ones make use of the linguistic structure of requests.

Text-based

This type of algorithm depends on static patterns of text for recognizing entity types, and is used mainly for entity types such as email and street addresses.Text-based recognition is also known as Regex-based.

The main shortcomings of these algorithms are that they lack flexibility and disregard the linguistic clues. Lack of flexibility is usually seen when order of words in a noun phrase change, given that order is the main distinguishing factor for text-base methods. Therefore, this is the main reason behind making use of the linguistic structures, as they formalize the relationship between different parts of a sentence, so that arbitrary ordering of words won't play a role.

Graph-based

They are called graph-based since they are based on the dependency graph. Dependency graph is the result of dependency parsing of a sentence that hierarchically breaks down a sentence to its sub-parts, according to their relative relevance and importance in the sentence. For example, for the sentence "postpone my meeting with Roy to next week,” the graph would be as follows:

Screen Shot 2020-02-10 at 5.11.52 PM.png

In this graph, the root is the word "Postpone", with "meeting" and "to" as its dependencies. From there, "my" and "with" depend on "meeting", "Roy" on "with", "week" on "to", and "next" on "week". This is a great source of information, because it tells us which words to focus on at each level of analysis. We start with "Postpone", which is the root of the graph and the most important word in the sentence, and each child of the root is meant to complement the meaning in a certain way. We will talk more about the dependency graph when we address the problem of intent recognition.

This graph is built using machine learning and based on two sources of information types: syntactic and semantic. We use both sources of information to determine whether a word is a part of a specific entity.

Syntax

We use the relative structure of each word with respect to other words as a feature. For example, "meeting" depends on "Postpone", and their relationship is of direct object type. This means that the action of postponing is directly applied to meeting. Or the relationship between "Postpone" and "to" is of preposition type, which means that the children of "to" modify the place or the time of the postponing act.

Semantics

Furthermore, we use the semantics of each word, as a secondary type of feature, along with its relative relationship with other words. This allows us to define entities in certain contexts, or in other words, as a way of disambiguating different entity types. Word embeddings are good proxies for semantic information, although they contain more than only semantic information.

This is the first in our series sharing how we’ve built our conversational AI. Thanks for reading!

Engineering, AIKailah BharathFebruary 12, 2020