Overall design goal(s): What is the general purpose(s) of the design process?
The aim is to build a generic, i.e. partially re-usable, multimodal spoken dialogue system in which speech synthesis and speech recognition can be studied in a human-machine dialogue framework. The system concept is quite advanced for its time and represents an eminently worthwhile goal. The same is true of the objective of achieving re-usability. The emphasis on speech synthesis and speech recognition marks the project as one of exploratory research. There was no particular focus on dialogue management. The effect seems to be that there are fewer detailed analyses and exact measurements of results than one would have expected had this component been a focus point.
Hardware constraints: Were there any a priori constraints on the hardware to be used in the design process?
A powerful (for that time) computer was needed to be able to do real-time processing. Even then, the vocabulary was limited to about 1000 words. Also the recognition quality put a limit to the vocabulary size. These conditions appear reasonable for the time.
Software constraints: Were there any a priori constraints on the software to be used in the design process?
An existing synthesiser as well as an existing recogniser were used, neither of them off the shelf products. Rather they were software components that were under continuous development. The effort on the recogniser was put in focus during the project. Was effort on speech synthesis de-emphasised in the project, contrary to the original goal? Why?
Customer constraints: Which constraints does the customer (if any) impose on the system/component? Note that customer constraints may overlap with some of the other constraints. In that case, they should only be inserted once, i.e. under one type of constraint.
No hypothetical customer constraints were introduced for this research system. ** This point deserves criticism, even in an exploratory research project. The basic advantage of assuming hypothetical customers is that the developers force themselves to face realistic problems and hence to be accountable for any deviations from a realistic development life-cycle. Such deviations may be justifiable from many different points of view but they are not likely to be recognised as such unless the project has (simulated) realistic boundary conditions.
Other constraints: Were there any other constraints on the design process?
The main limiting factors were personpower and knowledge. Preferably more people should have been involved but it was hard to find good people with the right speech technology background. One major constraint was that it was the first time the developers were involved in such a project. They made some mistakes that slowed them down. These were the kinds of mistakes people make when a domain is new and the work to some extent is based on trial and error. ** The Waxholm development team could have benefitted from an explicit best practice model for spoken dialogue systems development and evaluation.
Design ideas: Did the designers have any particular design ideas which they would try to realise in the design process?
The dialogue should be described by a grammar; the dialogue manager should be probabilistic, i.e. a matrix of topics and features expressing probabilities of what topic the user is talking about is central to the dialogue management, cf. Figure 7 in [Carlson and Hunnicutt 1995]. The probabilistic dialogue manager based on keywords in the users' input is an interesting innovative feature of Waxholm.
Designer preferences: Did the designers impose any constraints on the design which were not dictated from elsewhere?
No Lisp or Prolog person was involved so the developers were using C throughout the project (except SQL for database search).
Design process type: What is the nature of the design process?
Exploratory research on speech synthesis [was this dropped?] and speech recognition in human-machine dialogue.
Development process type: How was the system/component developed?
The development of the system may be described in terms of four phases. 1. The idea was conceived and initial preparations done. 2. First WOZ experiments were performed with about 30 users. 3. A second series of WOZ experiments were performed with about 40 users. 4. The full system was implemented.
The initial preparations included interviews with 3-4 people from the Waxholm company and a timetable was collected from the company. The developers were not allowed to make recordings in the Waxholm company. The initial system was based on what the 5-6 people in the project could come up with as regards lexicon, grammar, dialogue model etc. During WOZ, text (from the wizard's typing) and speech data were collected running the system with a wizard replacing the speech recognition module. Data were eventually collected from users interacting with the final system. After the first subjects were recorded, both the grammar and the dialogue model could be based on empirical data. The lexicon was expanded and the network probabilities were trained. After the second phase some major revisions were made based on observed interaction problems. No particular development methodology appears to have been followed. A new phase of development was entered when the team believed that "things might work".
Requirements and design specification documentation: Is one or both of these specifications documented?
No. In the absence of such specifications, the developers had no guidance wrt. when or to what extent they would have achieved their development objectives.
Development process representation: Has the development process itself been explicitly represented in some way? How?
Initially the system was designed based on intuition and discussions within the group. At an early stage the block diagram in Figure 2 was made that later on has been presented in many papers. One person had the task of implementing the information flow between modules (represented as boxes in the figure), and the data format was negotiated between the different people that were responsible for modules that should be connected.
The project group had a weekly meeting with a written protocol with numbered tasks: small and big ones, from "fix the loudspeaker" to "improve the recognition". At each meeting the list was gone through and the numbered tasks were taken off when finished. During these meetings design issues played an important role. The protocol is not available.
See also the Waxholm history below. The absence of any explicit development process representation means that Waxholm will be more difficult than needed to re-design and maintain, especially for people who were not involved in its development.
Realism criteria: Will the system/component meet real user needs, will it meet them better, in some sense to be explained, than known alternatives, is the system/component "just" meant for exploring specific possibilities (explain), other (explain)?
Realism and meeting real user needs was desired though not the main goal. The main goal was to explore speech. The Waxholm idea was chosen because it seemed a realistic application and was open-ended in the sense that the domain could easily be extended. The Waxholm travel agents provide person dependent information (e.g. depending on age). This is not being done by the system. It can hardly be claimed that the system will meet user needs better than the Waxholm travel agents. The Waxholm system was only in a loose sense meant to meet real user needs and relatively little effort was spent on ensuring that it did so. Thus the project had no: extended end-user contact, extensive work on domain delimitation, clear up-front performance criteria, final adequacy criteria, extended quantitative and qualitative evaluation throughout the development process, an explicit development methodology.
Functionality criteria: Which functionalities should the system/component have (this entry expands the overall design goals)?
The demonstrator application gives information on boat traffic in the Stockholm archipelago, including information about port locations, hotels, camping sites and restaurants. The user may obtain departure and arrival times for boats, have a map displayed with the place of interest, get an overview of lodging and dining possibilities, get a presentation [a list of names or more than that?] of possible places to visit. This list of Waxholm functionalities represents a sub-section of the information users may want to have before going on a boat trip in the Stockholm archipelago. The developers themselves found evidence that this sub-section will not satisfy real users.
Usability criteria: What are the aims in terms of usability?
The system is a walk-up-and-use system which requires no previous training of its users. The system should be able to perform a smooth dialogue with its users. Dialogue "smoothness" was never defined as an operational parameter in the Waxholm project, e.g. in terms of the development methodology, approach to task domain delimitation, system co-operativity parameters, or other evaluation criteria.
Organisational aspects: Will the system/component have to fit into some organisation or other, how?
N/A. Waxholm might have been installed as an independent information booth.
Customer(s): Who is the customer for the system/component (if any)?
There is no customer. The Waxholm company just thought it was fun. They have no intention to push the system to a final product that is robust enough for public use. No recordings were allowed by the company but the developers were allowed to interview 3-4 travel agents once. Later in the process the developers have contacted the Waxholm company several times to discuss specific issues. The company was not involved in domain delimitation.
Users: Who are the intended users of the system/component?
Users must speak Swedish; walk-up-and-use users. Walk-up-and-use is an appropriate paradigm for the application.
Developers: How many people took significant part in the development? Did that cause any significant problems, such as time delays, loss of information, other (explain)? Characterise each person who took part in terms of novice/intermediate/expert wrt. developing the system/component in question and in terms of relevant background (e.g., novice phonetician, skilled human factors specialist, intermediate electrical engineer).
Development of the dialogue component was made mainly by two engineers, one of which had long speech technology experience. Neither of them had a formal human factors training. The project did not put any particular emphasis on development for usability.
Development time: When was the system developed? What was the actual development time for the system/component (estimated in person/months)? Was that more or less than planned? Why?
Roughly one and a half person years were spent on the dialogue component. It was about the time that could be afforded. The Waxholm project was a three year project involving about 6-8 researchers, but it is hard to say what specifically is Waxholm and what is long term research on the basic technology. An initial version of the system based on text input has been in operation since September 1992. The project did not put any particular emphasis on dialogue component development.
Requirements and design specification evaluation: Were the requirements and/or design specifications themselves subjected to evaluation in some way, prior to system/component implementation? If so, how?
Since there were no requirement or design specifications, this was not possible.
Evaluation criteria: Which quantitative and qualitative performance measures should the system/component satisfy?
No particular evaluation criteria were set up from the very beginning. The parameters evaluated during the development process were relatively arbitrary and not based on any systematic approach. ** This is far from best practice procedures. The parameters evaluated were:
Number of turns: The number of turns needed to get a specific task completed was measured. The measurements were not related to any performance targets and their significance was not evaluated. [True? Then this is a measurement and not an evaluation.]
Naturalness - mixed initiative dialogue: A natural mixed initiative dialogue, rather than a prompted system-directed one, was given high priority, so the focus came to be more on the naturalness than on reaching the goal as fast as possible. No particular evaluation was made of the success of the mixed initiative dialogue strategy and no performance targets were formulated. [True?]
Naturalness - no interaction problems: During WOZ the developers looked out for interaction problems and problems observed were fixed as far as possible. However, no detailed and systematic analysis of user-system interaction and identification of problems was performed.
Robustness - of topic identification: It was evaluated how robust the topic identification was with erroneous recognition and so on.
Transaction success: Was it measured? Results? How defined?
The above measures were discussed at an early stage and formalised later during the project.
Evaluation: At which stages during design and development was the system/component subjected to testing? How? Describe the results.
Dialogue management: As regards the entire system and in particular the dialogue manager, only data from the Wizard of Oz experiments was evaluated in any detail, see below.
Phases 2 and 3: WOZ scenario-based simulation and progress evaluation: 66 subjects (17 female) each received three scenarios, the first one always being the same, (Figure 1). Scenarios were presented both as text on the screen and in synthetic speech. Users were encouraged also to use the system beyond the scope of the scenarios. A total of 14 scenarios were used. The scenarios were meant to cover different aspects of the system's domain. 14 scenarios were what the developers came up with without starting to repeat themselves. Scenario development was not principled otherwise. A problem was that users tended to re-use the vocabulary from the scenarios. Each scenario required that the user solved from one to four subtasks, a subtask consisting in, e.g., requesting a timetable, a map or a list of facilities. Each subtask required specification of several different constraints, such as departure port, destination port and departure day. Subjects had to provide the system with up to ten such constraints, with a mean of 4.3, in order to solve a complete scenario.
After the session subjects filled in a questionnaire with questions about weight, height, age, profession, dialect, speaking habits, native tongue, comments about the experiment, etc. Most subjects were department staff or undergraduate students. The issues addressed in the questionnaire mainly pertain to speech. The questionnaire was not developed in a principled way other than it contained a few questions which were considered relevant. The answers to the questionnaire have not been evaluated in detail. Age and position of participants have been extracted from the questionnaires and are shown in graphs presented in papers.
A total of 198 dialogues were recorded and analysed. The dialogues contained 1900 user utterances and 9200 words. The total recording time amounts to 2 hours and 16 minutes. After the first 37 sessions (35 subjects) all system parts went through a major revision. The first phase included approximately 1000 subject utterances. The responses "I do not understand" and "You have to reformulate" occurred in 35.8% of the system responses. In the second phase the dialogue manager was revised as well as the scenarios. In this phase 31 subjects produced 900 utterances. The improved system failed to understand 20.9% of the time. The system responded "I don't understand" 575 times corresponding to 268 occasions where consecutive repetitions are counted as one occasion. In 50% of the cases the system recNT COLOR="#000000">** [We might move the following (Complete parses, Perplexity, Word error rate) to a "reserve" file, inserting any significant measurements in the grid? It is not about dialogue management.]
Complete parses: The parser has been evaluated in several different ways. Most tests used a deleted [deletion?] estimation procedure. In the WOZ experiments 62% of700 utte people address within-domain issues. The vocabulary contains words from adjacent domains in order for the system to recognise such input and therefore give more precise answers, such as "I cannot make hotel reservations".
About 700 utterances are simple answers to system questions while the rest, 1200, can be regarded as user initiatives.
Word recognition accuracy was measured at 76.0% and later improved to 78.6% (laboratory).
The average utterance length was 5.6 words. The average length of the first utterance in the dialogues was 8.8 words. The utterance length distribution shows one maximum at two words and one at five words (corresponding to ellipses versus full sentences). Less than 3% of the utterances contain restarts like repetition of a word or a phrase or changes of a word. About one fourth of the restarts occur in interrupted words, that is, in words that are not phonetically completed.
Most system questions occurred when the system understood that the subject wanted a timetable displayed. If information was missing, the system took the initiative to ask for this information. The subjects answered the system questions in 95.4% of the cases. Thus subjects were cooperative. Only about 1% changed the topic during the system-controlled dialogue.
** [Most or all of the above is just measurements of whatever. If properly organised, we could keep it. But I see little or no evaluation of any kind! Do you agree? Then we could add a comment to that effect?]
Transaction success: The database contains 265 subtasks, about 84% of which were solved by the subjects. In 75% of the cases, 199 out of 265, the subjects had completed a subtask after one to five utterances. The subjects needed about seven utterances to solve one scenario. After the task was completed several subjects continued to ask questions in order to test the system. About three additional utterances per scenario were collected in this way. In 42 cases a scenario had been designed so that it could not be completely solved by a subject, corresponding to an a priori error rate of 21%. In half of these, 21 scenarios, some of the subtasks were solved by the subjects. Can we conclude that transaction success evaluation was not done?
Robustness - of topic identification: The topic identification method has been evaluated by using one quarter of the data, about 300 utterances, as test data, and the rest as training data, about 900 utterances. This procedure has been repeated for all quarters. The reported results are the mean values from these four runs. The eight possible topics (TIME-TABLE, SHOW_MAP, EXIST, TRIP_MAP, END_SCENARIO, REPEAT, NO_UNDERSTANDING, OUT_OF_DOMAIN) have a rather uneven distribution in the material with TIME_TABLE occurring 45% of the time. The topic NO_UNDERSTANDING is trained on a set of constructed utterances that are impossible to understand, even for a human. This topic is then used as a model for the system to give an appropriate "no understanding" system response. In principle, these utterances can still have a reasonable parse. However, the topic selection is certainly influenced by a poor parse. Using the unprocessed labelled input transcription yields 12.9% errors. By excluding 55 utterances, about 5% of the test corpus, predicted to be part of the "no understanding" topic, error is reduced by about 4% (8.8). When all extralinguistic sounds, about 700, are excluded from the input material the number of complete parses increases by about 10%. The prediction result (12.7% error and 8.5% with the "no understanding" utterances excluded) was about the same as in the first experiment. When only utterances giving a complete parse are included errors are reduced to 3.1% (2.9%). It is not known if an increased grammatical coverage will reduce the topic prediction errors. Can we say that they measured the results of the topic identification approach [In 1st WOZ phase? 2nd WOZ?] but that they had no target values to go for, that they did not do any progress evaluation (comparing successive measurements), and that they did not compare the approach to other approaches?
** [We might move the following (Complete parses, Perplexity, Word error rate) to a "reserve" file, inserting any significant measurements in the grid? It is not about dialogue management.]
Complete parses: The parser has been evaluated in several different ways. Most tests used a deleted [deletion?] estimation procedure. In the WOZ experiments 62% of 700 utterances (user answers) gave a complete parse while 48% of 1200 utterances (user initiatives) gave a complete parse. Responses to system questions typically have a very simple syntax. If extralinguistic sounds such as lip smack, sigh and laughing are excluded from the user initiative material, the result is increased to 60% complete parses. Sentences with incomplete parses are handled by the robust parsing component and frequently affect the desired system response.
Perplexity: The perplexity of the Waxholm data is about 26 using a trained grammar. If only utterances with complete parses are considered the perplexity is 23. On an HP 735 it takes about 17 msec to process an utterance.
Word error rate: The parser has also been evaluated in an N-best list resorting framework. Totally 290 N-best lists with about 10 alternatives each were generated, using an early version of the speech recognition module. Several of the utterances were answers to simple questions and the average utterance length was about five words. The top choice using a bigram grammar as part of the recognition module gave a word accuracy of 76.0%. The mean worst and best possible accuracy in the lists were 48% and 86.1%. After resorting to use the STINA parser the result improved to 78.6% corresponding to about 25% of the possible increase.
Re-usability: Were there any specific results on the claimed "partial re-usability"?
Multimodal aspects: Most of the graphics was only added after the last WOZ experiments: the face, speech recognition feedback, tables in different places (earlier all information was displayed in the same place). These additions were made to cope with problems observed in the WOZ experiments. [Describe the problems.] The additions have not been evaluated.
** [I have not tried to evaluate the above yet. We could do that when you have re-worked and reduced it a bit. It seems that they have done very little of interest in dialogue management evaluation!]
Mastery of the development and evaluation process: Of which parts of the process did the team have sufficient mastery in advance? Of which parts didn't it have such mastery?
The main task of the project was to learn and model, which was a success. The project was largely a competence-building exercise.
Problems during development and evaluation: Were there any major problems during development and evaluation? Describe these.
No human problems were encountered but several significant technical problems outside the scope of the project, such as system problems and hardware problems. Several acoustic problems were experienced and also system software problems reduced the development speed.
Development and evaluation process sketch: Please summarise in a couple of pages key points of development and evaluation of the system/component. To be done by the developers.
WOZ experiments were performed to collect a database of user utterances. The duration of the WOZ experiments was a time span of about 4 months. Two iterations were made. The first iteration involved about 30 subjects, the second involved about 40 subjects. Between the iterations, major revisions of all system parts were made on the basis of the collected experiences. On-the-fly modifications were made throughout the experiments. In particular in the beginning the system modules did not perform very well. A wizard replaced the speech recogniser; early versions of the other system modules were used. The subjects are seated in an anechoic room in front of a display. A high-quality microphone was used. The wizard is seated in an adjacent room facing two screens, one displaying what is shown to the subject and the other providing system information. The subjects all knew that the wizard replaced the speech recognition. All utterances were recorded at 16 kHz and stored together with their respective label files. The label files contain orthographic, phonemic, phonetic and durational information. A `label file' is a transcribed and annotated corpus bit. The MIX standard annotation was used which may easily be transformed into wav format. All system information is logged. An experimental session starts with a system introduction presented in text on the screen. The text is also read by speech synthesis, thus permitting the subject to adapt to the synthetic voice. The subjects practiced the push-to-talk procedure by reading a sound calibration sentence and eight phonetically rich reference sentences.
Waxholm history: Could we annotate this history with the 4 phases noted above? I am a bit confused about the phases, and they are important to, i.a. our presentation of the evaluation efforts made.
91-09 Preparatory phase -> Block diagram (see Figure 2).
91-12 First grammar rule set in parser.
92-07 Network development + Oracle implementation.
93-03 First pilot recording + Recording protocol + Struggling with audio quality.
93-07 Formal start of project in the Language Technology programme
93-09 Data collection starts (first WOZ). Where is the second WOZ?
A*-search implemented and connected to phoneme neural network.
Lexicon structure, transcription - normative for all modules.
Automatic labelling (MIX standard used for annotation (may easily be transformed to .wav)) - name conventions, non-language symbols.
93-11 All modules including recognition connected.
Data analysis using evaluation software.
Phoneme network trained on Wizard of Oz recordings.
First speaker-independent evaluation.
94-01 Implementation of rule-controlled dialogue.
94-05 Evaluation of collected data:
Recognition performance + Syntax analysis + Dialogue control.
94-07 Improved search method + word network output.
True parse probabilities + improved robust parse.
Labelling and labelling....
94-10 First complete run-time version of the system (video available).
Rule controlled dialogue.. topic prediction.
Recognition accuracy...Which evaluations, if any, were done on the final system?
96-06 End of project
96-10 On display at the technical fair in Stockholm.
Component selection/design: Describe the componen ts and their origins.
All parts of the system were developed in-house. A graphical interface presents the dialogue grammar graphically. Both the syntax and the dialogue networks can be modelled and edited graphically with this tool. There is an interactive development environment (it is possible to study the parsing and the dialogue flow step by step when a graphic tree is being built). It is possible to use log files collected during WOZ as scripts to repeat a specific dialogue, including all graphic displays and acoustic outputs.
A multi-lingual text-to-speech system was modified for the application. A face-synthesis module was built. Both the visual and the speech synthesis are controlled by the same synthesis software.
The STINA parser is based on MIT's TINA parser. ** [Move details to the grid or maybe some of them just to the reserve file, shorten this para. considerably?] The parser runs with two different time scales corresponding to the words in each utterance and to the turns in the dialogue. Characteristics are a stack-decoding search strategy, a feature-passing mechanism to implement unification, and a robust parsing component. Parsing is done in three steps. The first step makes use of broad categories such as nouns, while the following step expands these into more detailed solutions. The last step involves re-calculation of hypothesis probabilities according to a multi-level N-gram model. The first grammar and lexicon was based on experiences from TINA plus pure guesses.
Graphic feedback showing the system's recognition of the user's spoken input was introduced to deal with the problem observed during WOZ that users often did not get enough feedback to be able to decide if the system had the same interpretation of the dialogue as the user.
Robustness: How robust is the system/component? How has this been measured? What has been done to ensure robustness?
A lot of work was spent to make the system technically [please clarify] robust. No formal measurements have been made except fixing the problems when they were noticed. No particular robustness metrics have been adopted or used.
Maintenance: How easy is the system to maintain, cost estimates, etc.
Given that the machine is not changed maintenance is at a very low cost. The Waxholm system is not directly under development any more but it is kept running for demonstrations from time to time. Maintenance cost are kept at a minimum, i.e. a couple of days per year. The project never aimed at design for sustained development and re-design.
Portability: How easily can the system/component be ported?
To adjust to a new machine would be very costly. It would take a lot of work to port the system. The project never aimed at design for portability.
Modifications: What is required if the system is to be modified?
A modification of the domain implies an addition concerning how to handle a new topic. It is the ambition that the implementation and the training procedures should, as much as possible, be kept the same. [the answer to this entry seems a bit strange? Can se conclude that the aim of partial re-usability was not achieved?]
Additions, customisation: Has a customisation of the system been attempted/carried out (e.g. modification of a part of the vocabulary, new domain/task, etc.)? Has there been an attempt to add another language? How easyis it (how much time/effort) to adapt/customise the system to a new task? Is there a strategy for resource updates Is there a tool to enforce that the optimal sequence of update steps is followed ?
This has not been attempted.
Property rights: Describe the property rights situation for the system/component.
The system/component belongs to KTH.
Documentation of the design process
See Figures 1, 2 and 3 in the grid description.
References to additional project/system/component documentation
Bertenstam, J. Blomberg, M., Carlson, R., Elenius, K, Granström, B., Gustafson, J., Hunnicutt, S., Högberg, J., Lindell, R., Neovius, L., de Serpa-Leitao, A., Nord, L. and Ström, N. (1995):" Spoken dialogue data collection in the Waxholm project" STL-QPSR 1/1995, pp. 50-73. (paper - 23 pages, Postscript 468kb).
Carlson, R., Hunnicutt, S. and Gustafson, J. (1995):" Dialogue management in the Waxholm system" Proc. Spoken Dialogue Systems, Vigsø (paper - 4 pages, Postscript 109kb).
Bertenstam, J. Blomberg, M., Carlson, R., Elenius, K, Granström, B., Gustafson, J., Hunnicutt, S., Högberg, J., Lindell, R., Neovius, L., de Serpa-Leitao, A., Nord, L. and Ström, N. (1995):" The Waxholm system - a progress report" Proc. Spoken Dialogue Systems, Vigsø (paper - 4 pages, Postscript 951kb).
Carlson, R. (1996):"The Dialog Component in the Waxholm System", Proc. Twente Workshop on Language Technology (TWLT11) Dialogue Management in Natural Language Systems, University of Twente, the Netherlands (paper - 10 pages, Postscript 597kb).
Carlson R. and Hunnicutt S. (1996):"Generic and domain-specific aspects of the Waxholm NLP and Dialog modules". Proc of ICSLP-96, 4th Intl Conference on Spoken Language Processing, Philadelphia, USA, Oct 3-6, 1996 (paper - 4 pages, Postscript 111kb).
Carlson R. and Hunnicutt S. (1995):"The natural language component - STINA". STL-QPSR 1/1995, pp. 29-48.
Personal communication with Rolf Carlson.
Waxholm: Deficiencies in current practice (life-cycle)
The idea for doing the following is to synthesise the remarks made above in the life-cycle document. This can be done when we have a stable document after your next iteration, Laila, and refined when we have the final responses from KTH.
Deficiencies: missing, flawed, obsolete or inadequate dialogue engineering practices, procedures, guidelines, methods, tools and supporting theory.
Criteria: general SE conformance, procedural consistency and completeness, development efficiency and standard assumptions on typical performance.
In interpreting the following critical remarks on the Waxholm life-cycle, it must be borne in mind that the system is one of exploratory research into selected aspects of spoken dialogue systems (speech recognition [and speech synthesis?] and hence cannot be judged on the same criteria as a commercial project.