This paper presents an analysis of the Daimler-Benz dialogue manager (D-B DM). The analysis is presented in the form of (a) a 'grid' which describes the system's properties with particular emphasis on dialogue management, (b) a life-cycle model which provides a structured description of the system's development and evaluation process, and (c) supporting material, such as component architecture, dialogues and references.
The presented information will be cross-checked with the developers of D-B DM as well as with the complementary descriptions of other aspects of the D-B system provided by the DISC partners. These other descriptions address language understanding and generation, done by IMS.
Demonstrator: Available in Ulm
Contact: Paul Heisterkamp
The Daimler-Benz dialogue manager is a stand-alone module. Basically
it origins from the Sundial project. The task domains of the Sundial systems
were train timetable information and flight information. However, the dialogue
manager has later been used in four other (research as well as commercial)
applications with continuous speech input. These four systems include Access
which is on automation of call centres for personal intensive applications
in insurance; STORM which is an application that provides digital road map
updates with dynamic information on closed streets, ongoing work, traffic
jam, etc. used for regional traffic management in the Stuttgart area; two
internal projects the contents of which are not publicly available.
|Influencing users||Yes, if the uninfluenced stuff does not work|
|Transaction success||80%+ (overall application)|
|General evaluation||No ISO standards or other general methods used.|
|Nature||Continuous, spontaneous speech|
|Vocabulary||5000; good performance for up to 10000 words but not used in any application yet|
|Word hypotheses||Yes, Verbmobil word hypothesis graph|
|Grammar||Statistic language model or finite state model over categories|
|Input||Files (for pre-recorded speech), text strings for TTS|
|Lexicon||No relation with parser or recogniser lexicon|
|Sound generation technique||Coded or parametric, e.g. formant synthesiser; PSOLA from Verbmobil|
|Prosody||Yes (for coded as well as for parametric)|
|Pronunciation description units||-|
|Lexicon||Up to 5000 full forms are used in applications|
Lexical grammar approach based on the theory of Unification Categorial Grammar. The amount of grammar rules is restricted to a few basic rules of combination. Lexical entries are represented as complex feature structures which are combined by simple unification. Feature structures (or signs) consist of morphologic, syntactic and semantic attribute fields. There are two types of sign: basic signs without arguments and functor signs with a list of arguments. Semantic representation is constructed simultaneously with the syntactic derivation by stating co-references between syntactic and semantic attributes in the functor sign.
Also other grammar types can be used: LAUG (left association unification grammar) and PSG (phrase structure grammar).
Word graph parser. Word graphs are directed acyclic graphs. Each edge is labelled with a scored hypothesis and each node with a point in time. There are no gaps or overlaps in word graphs. A word graph has a single start node and a single end node. Each path from the start node to the end node forms a possible sentence hypothesis. Scores are positive numbers which assign a (pseudo-)probability measure to a word hypothesis. 0 is the best score. The A* algorithm is used for search and integrated in an agenda driven chart parser whose initial edges are built from word hypothesis. The agenda is initialised with seed entries ordered with respect to a heuristic score assignment. Seed definitions are provided by top-down predictions from the dialogue management component. The initial ordering on the agenda which combines heuristics with contextual knowledge implements an island parsing strategy. The parser terminates with the first complete hypothesis. If no complete hypothesis can be found, a robust parser delivers partial solutions.
The resulting syntactic and semantic descriptions are passed to the dialogue level. Parsing of multiple results between the linguistic level and the dialogue level is allowed.
|Semantics||SIL (Semantic Interface Language) is used for representing the surface semantics of an utterance as one or more independent structures.|
|Discourse, context||Context is used in terms of predictions.|
|Generation||Separate module. Simple filling of sentence template.|
|None (unimodal system)||Yes.|
|Focus, prior||Expectations (see below) as well as knowledge of the surface structure of the previous system utterance are taken into account to reduce the search space. They are used to focus search and decide where to start.|
The contextual interpretation (belief module) anchors the utterance parts received from the linguistic analysis in their order in the surface semantic representation of the discourse universe. Anchoring means instantiation of possible discourse objects through the surface SIL expression. A part is anchored into the semantic context spanned by the preceding question or spanned by the part(s) already anchored. The belief module also interacts with the task module on sub-task identification.
Linguistic markers, such as topic shift markers, are not used.
Predictions are being used in relation with the recogniser. A set of dialogue goals is mapped onto a dialogue state descriptor. From this a language model is chosen.
Semantic predictions are used in the interaction between contextual interpretation and linguistic analysis. A semantic prediction is a partially instantiated structure of surface semantic descriptions or a list thereof. It can specify the expected contents of an utterance to be one of a set of lexical entries (downward prediction: e.g. yes or no or quasi-synonyms of these). Or it can specify the expected contents to be of a certain class of lexical entries which have the same semantic description or a certain class of semantic structures which can fill the same roles in larger structures (upward prediction: e.g. city + preposition 'to'). The parser uses the predictions to start the search of the word graph at nodes fulfilling predictions. Moreover it can check partial results against the predictions to enhance the speed of parsing and to guide the parsing in that direction.
Semantic predictions are presented in SIL and handled by the linguistic interface. Predictions for the recogniser are just language model numbers.
|Task(s)||The dialogue manager is fairly task-independent. In the German Sundial system in which it was first used, the task was train timetable information. However, the dialogue manager is being used in several (research as well as commercial) applications with continuous speech input. These include information providing applications like train time table and flight enquiry systems, and information seeking applications like road map update for long term and short term modifications used for regional traffic management in the Stuttgart area, and telephone-based applications for direct insurance and call management.|
The tasks for which the system has been used have been well-structured tasks requiring a well-defined set of pieces of information (i.e. slots to be filled) in order to provide the requested information.
In Access which is the largest application so far incorporating the dialogue manager, about 25 pieces of basic information can be asked for/provided by the system. Access has an active vocabulary of about 5000 words. Because of the domain which requires that e.g. all car types are included in the system's vocabulary, it has been necessary to chunk the vocabulary since it far exceeds the 5000 available at that time.
The task structure in the existing applications has a depth of 2-3 levels. In Access, e.g., there are two levels and in STORM there are three.
The domain communication is mixed initiative. The system will ask questions but the user may provide more information than asked for. In fact the user may provide all the parameters needed by the system in one utterance if s/he likes.
The dialogue strategy is one of an ordered set of possibilities that serve to determine the optimal continuation of the dialogue. It is ordered in the sense that its setting corresponds to the minimal number of dialogue turns to be executed in order to solve the user's request, given that a user mentions all the relevant parameters for the request in the first utterance. The choice of the current strategy follows a degradation and recovery meta-strategy. The maximum strategy threshold can be set by the system developer, cf. Figure 4. The strategy is dynamic in that the system adjust its setting according to whether the system encounters a contradiction from the user or not. If the user contradicts the system it is assumed that something in its understanding went wrong. It thus tries to restrict the understanding task by asking more specifically, finally degrading to a leading initiative mode where the user is asked not only to provide an isolated task parameter but also to give it in some specific form that makes e.g. the recognition more reliable because a specifically restricted language model can be used. An example of the degradation strategy in use is shown in Figure 3.
The system can do general contextual inferences as well as task dependent inferences. General inferences include e.g. time inferences and general contradiction detection. Task dependent inferences are tied to task objects. For example in the Access system the dialogue manager may ask the application about the power of a motor and get back a figure. This figure may mean horsepower or it may mean kilo watt. Which interpretation to use is decided by the belief module depending on the current context.
Neither speech acts nor indirect speech acts are
|Interaction level||The system's dialogue strategy has five levels, cf. Figure 4. Graceful degradation is used in case of misunderstandings.|
|Implementation of dialogue management|
The approach taken regards dialogue as a joint activity. It is based on a layered set of units that, taken together, model the dialogue as a combination of belief and intention states of the system.
The dialogue manager first tries to incorporate the surface semantic description of the user utterance into the contextual model. This may result in one of five types of change in the contextual model. Any sub-part of the utterance may either be new for the system, repeated by the user, inferred by the system, modified by the user or negated by the user. These changes lead to an appropriate change in the goal states of the pragmatic component (or dialogue module) as do changes coming from the application system interface (or task module) such as a request for a new parameter or the delivery of a solution to the caller's request. From these goal states the pragmatic component determines its next utterance trying to reach as many goals as possible without overloading the caller with information or different requests. No more than 3-4 items in an utterance are accepted.
Dialogue goals may belong to one of three classes: initiative, reaction and evaluation. A request goal, e.g., is an initiative, a confirm goal is an evaluation, etc. The dialogue strategy determines how initiatives, reactions and evaluations are ranked and combined. In general, evaluations that are not initiatives are ranked higher than reactions, and reactions have precedence over initiatives. This ensures that answers to questions are realised earlier than system questions. The standard setting is that any number of reactions (except confirm goals for inferred items) and one initiative may be realised at the same time, making the dialogue fast. If something goes wrong a meta-strategy of degradation and recovery may move this setting, cf. Figure 4.
|Speech acts||The system does not identify speech (or dialogue) acts in the users' input.|
|Discourse particles||The system does not identify discourse particles in the users' input.|
|Co-reference||Intra-sentential co-references are resolved by the parser. Other kinds of co-reference resolution are not done but a mechanism for doing it is already incorporated. It is based on interaction between the belief module and the linguistic history maintained by the linguistic interface. For example the utterance "Does it stop in Ulm" has "vehicle" attached to "it" as surface semantics. Then a backward search in time is made until the first possibility which satisfies the conditions is met.|
|Ellipses||The system copes with ellipses via robust parsing allowing non-complete sentences.|
|Segmentation||The parser identifies sentences and phrases (parts) and represent an utterance as one or more independent structures (parts). This is the input to the dialogue management part.|
The linguistic interface module is responsible for maintaining a linguistic model of system and user utterances. Each SIL structure has an ID attached which is maintained by the linguistic history. Semantics are recorded and you may ask which surface realisations exist. Thus linguistic history is mainly surface structure representation. System utterances are used to trigger predictions.
The record may be used for co-reference resolution, cf. above.
|Topic||Semantic objects in the belief module have a kind of time stamp relative to each other. Semantic objects are represented in SIL. This information is not used for the moment but could be used in case a user mentions the same value for a slot more than once, e.g. Hamburg as the departure station, since it indicated that there may be some misunderstandings in the dialogue.|
|Task||The task module contains (represented in task SIL) all the most recent versions of data provided during the dialogue. This means that the dialogue can be finished at any time transferring all data to either the application system (if this was not done during the dialogue) or to the terminal masks of the human operator.|
|Performance||The system does not really maintain a record of the user's performance during interaction. However, if it detects a contradiction from the user, the system assumes that it has misunderstood something and tries to facilitate understanding by asking more specifically (degrading to a lower level). The dialogue manager has five levels of interaction, the lowest one being spelling mode. This has so far been sufficient. The strategy threshold can be set and you can never get above this threshold. The maximum number of allowed failures in a turn can be set. Usually three retries have been permitted. If the system still cannot understand the user it will then redirect to an operator if possible, or otherwise tell the user that it will stop the dialogue due to lack of understanding and ask for a re-try.|
|Data||The domain data used by the system during interaction depend on the precise application. In the German Sundial system, e.g. the data were train stations, connections, dates, departure and arrival times. The data have in all applications been fully realistic, e.g. a full train timetable. The representation of data varies, depending on the application. In the task module there will be a translation from Prolog to whatever the applications needs when data have to be retrieved.|
|Rules||The basic idea is that the available information should be exploited to the extent possible. For instance the train timetable database has a marker as to whether a train runs on a daily basis. If this is the case the system needs not negotiate a date with the user. Such inferences are carried out in interaction between the task module and the belief module. The task module also knows about system defaults and inferences. For inferences, see also communication above.|
|Goals||From the system's point of view, the user's goal during interaction is to carry out a task which the system knows about, such as to get train timetable information or to get an insurance offer.|
|Beliefs||Through feedback the system tries to make clear to the user what it has understood. Then it is up the user to contradict if his/her beliefs about the dialogue differ.|
|Preferences||< FONT COLOR="#000000">In the current systems there are no user preferences handled. However, this facility could easily be added, but it needs caller identification.|
|User group||There are no distinctions among user groups, such as between domain novices and experts, novices and experts in using the system. However, it would be a possibility to add this using the level strategy.|
|Cognition||This may be considered part of graceful degradation.|
|Component architecture and function|
The component has a domain-independent, generic architecture (easy to adapt to different task domains). It is language independent since it uses its own language (SIL) for input representation.
The component has been implemented in Quintus Prolog and runs on a variety of different platforms (Unix, NT, Windows95).
The syntactic and semantic descriptions resulting from the parsing process are passed to the dialogue level. Passing of multiple results between the linguistic level and the dialogue level is allowed. SIL (Semantic Interface Language) is used for interfacing between the linguistic analysis and the contextual interpretation via the linguistic interface module. SIL is also used for interfacing between the message planning module and the generation module.
The dialogue manager also communicates predictions to the recognition module and to the linguistic analysis module.
The dialogue manager consists of five modules:
The linguistic interface module interfaces the dialogue manager with the parser and is responsible for maintaining a linguistic model of system and user utterances.
The dialogue module is responsible for maintaining a model of dialogue context, building an interpretation of user utterances and determine how the dialogue should continue. It receives as input from the belief module changes in the knowledge state. It then finds the best local continuation in terms of a system utterance, based on the actual dialogue situation.
To determine the appropriate contextual interpretation of user utterances, the dialogue module interacts with the belief module which maintains a model of belief containing not only concepts created directly as a result of user utterances, but also inferential extensions. For example, if the system initiated an exchange to determine the departure date of a flight, this exchange can be closed if the belief model can interpret the user's utterance as referring to a date concept. The belief module requires context information from the dialogue module in order to guide the interpretation process. The belief module receives input from the linguistic analysis via the linguistic interface module. It anchors the received input into the semantic context. This process results in a change of the knowledge state and the instantiations of surface expressions pertaining to the application are mapped onto task objects by means of transfer rules.
The belief module also co-operates with the message planning module in order to provide semantic descriptions of concepts referenced in the plan of system utterances.
The task module is responsible for maintaining a model of the task structure of the dialogue, consulting domain-specific application databases and informing the dialogue module of the current state of the task. Typically, this involves deciding whether sufficient task information has been provided by the user and, if not, which additional parameters are required. In case of sufficient but incorrect information provided by the user, the system tries to relax the task parameters and propose alternatives. The task module receives from the belief module changes in the state of the task objects.
|Flow||See the flow among the sub-components in Figure 1.|
|Function||The dialogue manager is the control module of an overall system.|
|Tools and methods||-|
|System resource utilisation||-|
|Shared information resources||-|
Figure 1. The architecture of the D-B dialogue manager.
System1: This is the Automatic Intercity Inquiry System. Which information do you want?
Caller2: I need a connection from Essen to Hamburg tomorrow.
System3: You want to travel from Essen to Hamburg. You want to travel on Wednesday, September 22nd. At what time do you want to leave?
Caller4: At 16.00 hours.
System5: You want to leave around 16.00 hours?
System7: I have found two connections for you. The first train leaves Essen at 15.45, 15.45, arriving in Hamburg main station 19.32. The second train leaves Essen at 16.18, 16.18, arriving in Hamburg main station 19.43. Do you have another inquiry?
Caller8: No thanks [Hangup].
Figure 2. Combined confirmation strategy in the German Sundial system [Heisterkamp 1993].
System1: This is the Automatic Intercity Inquiry System. Which information do you want?
Caller2: I want to go from Essen to Hamburg.
System3: You want to travel from Essen to Hamm? On what day do you want to travel?
Caller4: To Hamburg.
System5: You want to travel to Hannover?
Caller6: No, Hamburg.
System7: You want to travel to Hamm?
System9: Please spell the name of the arrival city.
System11: You want to travel to Hamburg?
System13: On what day do you want to travel?
Caller14: On Wednesday.
System15: You want to travel on Wednesday, September 22nd. At what time do you want to leave?
Caller16: Around four o'clock.
System17: You want to leave around 16.00 hours?
System19: I have found two connections for you. (...) Do you have another inquiry?
Caller20: No thanks [Hangup].
Figure 3. Degradation and recovery in the German Sundial system [Heisterkamp 1993].
Figure 4. Dynamic Dialogue Strategy [Heisterkamp 1993].
System/component screen shot(s)