Detection of back-channel feedback timings based on consistently tagged spoken dialogue corpus

THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE. 464-8601 E-mail: ohno@nagoya-u.jp 1,219 82.2% 66.1% Abstract Detection of back-channel feedback timings based on consistently tagged spoken dialogue corpus Tomohiro OHNO, Yuki KAMIYA, and Shigeki MATSUBARA Graduate School of International Development Graduate School of Information Science, Nagoya University Furo-cho, Chikusa-ku, Nagoya-shi, 464-8601 Japan E-mail: ohno@nagoya-u.jp This paper describes analysis and detection of back-channel feedback timings, aiming at realizing in-car spoken dialogue systems with the high responsiveness. In our research, we analyzed the characteristics of back-channel feedback timings comprehensively by using in-car speech dialogue corpus where back-channel feedback timings were consistently tagged. And then, we tried to statistically detect the timings based on the result of analysis. An experiment using 1,219 dialogue turns, providing a precision of 82.2% and a recall of 66.1%, has shown the effectiveness of our method. Key words corpus, spoken language, dialogue system, tagging 1. 1

[1] [6] 82.2% 66.1% 2. [2] [1] [6] 4 [7]CIAIR [8] 1 : あいづちタグタグが付与付与されたかされたか否か (1: された 0: されていない ) 文節番号形態素 or ポーズ節境界 sp_sp_sp 0 0.000 0.030 0 (FとF)_ ト記号 - 一般 0 0.030 0.090 1 服 _ フク _ 服 _ 名詞 - 普通名詞 - 一般 0 0.090 0.340 を _ オ _ を _ 助詞 - 格助詞 0 0.340 0.520 sp_sp_sp 0 0.520 0.610 2 買い _ カイ _ 買う _ 動詞 - 一般 _ 五段 -ワア行 - 一般 _ 連用形 - 一般 0 0.610 0.850 たい _ タイ _ たい _ 助動詞 _ 助動詞 -タイ_ 連体形 - 一般 0 0.850 1.080 ん _ ン _ ん _ 助詞 - 準体助詞 0 1.080 1.150 だ _ ダ _ だ _ 助動詞 _ 助動詞 -ダ_ 終止形 - 一般 0 1.150 1.240 けど _ ケド _ けど _ 助詞 - 接続助詞 / 並列節ケレドモ / 0 1.240 1.420 3 どっ _ ドッ _ どっ _ 代名詞 1 1.420 1.670 か _ カ _ か _ 助詞 - 副助詞 0 1.670 1.850 4 近く _ チカク _ 近く _ 名詞 - 普通名詞 - 副詞可能 0 1.850 2.190 に <H>_ ニ <H>_ に _ 助詞 - 格助詞 0 2.190 2.880 sp_sp_sp 0 2.880 3.080 pause_pause_pause 1 3.080 4.992 5 安い _ ヤスイ _ 安い _ 形容詞 - 一般 _ 形容詞 _ 連体形 - 一般 0 4.992 5.362 6 お _ オ _ お _ 接頭辞 0 5.362 5.422 店 _ ミセ _ 店 _ 名詞 - 普通名詞 - 一般 0 5.422 5.652 7 ある _ アル _ ある _ 動詞 - 非自立可能 _ 五段 -ラ行- 一般 _ 終止形 - 一般 0 5.652 5.832 か _ カ _ か _ 助詞 - 終助詞 0 5.832 5.982 なあ _ ナー _ なあ _ 助詞 - 終助詞 0 5.982 6.272 1 開始時間終了時間 : : 200 200 (sp) (pause) : HITACHI HitVoice 50 1 ChaSen [9] Unidic [10] CBAP [11] Julius [12] CIAIR 1 4 A B,C,D κ [13] κ usable quality.67 < κ <.80 [14] 98.5% [7] 2

1 346 11,181 14,643 43,723 94,030 19,142 5,416 3 2 35 1,219 1,421 4,507 9,881 1,813 546 21.29% (329/1545) 3.77% (21/557) 9.61% (27/281) 12.81% (31/242) 1.71% (3/175) 36.59% (60/164) 5.45% (6/110) 11.59% (8/69) 15.91% (7/44) 21.62% (8/37) 2.70% (1/37) 4 2.05% (7/342) 10.94% (14/128) 97.60% (122/125) 10.53% (12/114) 0.00% (0/67) 11.11% (7/63) 10.53% (4/38) 94.29% (33/35) 82.35% (28/34) 35.71% (10/28) 3. [7] 2 11,694 546 4.7% 3. 1 3,288 504 15.3% 4.7% b i b i+1 b i b i+1 sp pause b i+1 b i b i+1 b i 3 3. 2 1,082 283 26.2% 15.3% あいづち発生割合 40% 35% 30% 25% 20% 15% 10% 5% 0% 30 50 70 90 110 130 150 170 190 2 sp の時間長 [msec] c i c i+1 3.1 4 3. 3 200ms 200ms sppause 2 200ms sp 1 200ms sp 20.2%(241/1,196) 4.7% pause sp 2 sp sp 700 600 500 400 300 200 100 0 頻度 3

25% 1400 25% 1400 あいづち発生割合 20% 15% 10% 5% 1200 1000 800 600 400 200 頻度あいづち発生割合 20% 15% 10% 5% 1200 1000 800 600 400 200 頻度 0% -9-8 -7-6 -5-4 -3-2 -1 0 平均発話速度 ( 話者 ) との差 [ モーラ / 秒 ] 0 0% -8-7 -6-5 -4-3 -2-1 0 平均発話速度 ( モーラ数 ) との差 [ モーラ / 秒 ] 0 3 4 3. 4 (/ ) () () 1.4%(88/6,132) 10.6%(458/4,343) 1.5%(100/6,479) 11.2%(446/3,996) 1 / 3 4 1 3 4 5 - - 5.68% (189/3,326) - - 6.22% (56/901) - - 6.00% (32/533) - - 6.35% (31/488) - - 5.45% (22/404) - - 6.29% (21/334) - - 3.67% (8/218) - - 12.15% (22/181) - - 13.19% (19/144) - - 6.36% (7/110) 3. 5 [2] 100ms 5 10 - -- - 6 10 - - - - [5] 100ms ( 50ms 25ms) 1 3 () 3 ( - - ) Praat [15] 5ms 4

6 - - 4.49% (72/1,602) - - 9.45% (151/1,598) - - 11.85% (139/1,173) - - 1.44% (12/836) - - 9.87% (74/750) - - 2.33% (14/601) - - 1.67% (8/478) - - 0.87% (4/460) - - 2.08% (8/384) - - 1.31% (4/305) 4. 1 m 1 m n 1 Support Vector Machine(SVM) m i SVM 7 m i m 1 m i 1 m j m i m i 1 m i 1 m i 1 sp m i 2 m i 1 pause m i 3 9 10 11 12 () 3. 4 13 14 15 3. 5 5. [7] 5. 1 1,219 9,962 SVM LibSVM [16] 2. B C D 7 SVM 1. m j (m i ) 2. 1 m j 3. 1 m j 4. m j 5. 4 m j 6. m i 1 sp 7. 6 m i 1 α( ) 4 α < = 0.10.1 < α < = 0.170.17 < α < 0.2α = 0.2 8. m i 1 pause 9. m j ( ) 10. 9 m j ( ) β ( / ) 3 β < 22 < = β < 66 < = β 11. m j () 12. 11 m j () γ(/ ) 3 γ < 22 < = γ < 66 < = γ 13. m i 1 δ( ) 4 δ < = 0.60.6 < δ < = 1.41.4 < δ < = 2.92.9 < δ 14. 15. 8 F 82.2% (361/439) 66.1% (361/546) 73.3 B 74.0% (168/227) 78.9% (168/213) 76.4 C 74.4% (157/211) 73.7% (157/213) 74.0 D 79.0% (162/205) 76.1% (162/213) 77.5 5. 2 8 F F κ 0.728 2. A B, C, D κ 0.755, 0.727, 0.763 1,023 83.9% 1 349 48.1%(168/349) 1 5 1 2 3 2,3 1 sp pause 5

基本区間 (F えっと ) 年賀状のはがき sp 買いたいから郵便局に行きたいんだけど近くで sp pause いちばん近いとこで sp pause 郵便局あるかなあ正解 1 1 1 1 実験結果 1 1 1 1 5 基本区間 (F んー ) 雰囲気のいいお店がいいけど sp pause どっちがいいかな正解 1 実験結果 1 6 (1) 基本区間すしのこうずしっていうの sp pause を sp pause お sp pause 願いします正解 1 実験結果 1 1 7 (2) 基本区間 (F えーと ) sp ファーストフード sp みたいな sp お店 sp どっか sp あるかなあ正解 1 実験結果 8 5. 3 78 17 6 1 sp pause 61 1 39 sp pause 7 sp pause sp 185 17 1 sp pause 168 1 103 (sp pause 8 sp 6. 82.2% 66.1% F 73.3% (No. 21650028) [1] N. Cathcart, J. Carletta, and E. Klein, A shallow model of backchannel continuers in spoken dialogue, Proceedings of the Tenth Conference on European Chapter of the Association for Computational Linguistics (EACL2003), pp.51 58, 2003. [2] 1997 [3] S.K. Maynard, Japanese conversation : self-contextualization through structure and interactional management, Ablex, 1989. [4] --, pp.261 279 1984 [5] N. Kitaoka, M. Takeuchi, R. Nishimura, and S. Nakagawa, Response timing detection using prosodic and linguistic information for human-friendly spoken dialog systems, Journal of the Japanese Society for Artificial Intelligence, vol.20, no.3, pp.220 228, 2005. [6] N. Ward and W. Tsukahara, Prosodic features which cue back-channel responses in English and Japanese, Journal of Pragmatics, vol.32, pp.1177 1207, 2000. [7] Y. Kamiya, T. Ohno, and S. Matsubara, Coherent backchannel feedback tagging of in-car spoken dialogue corpus, Proceedings of the 11th Annual SIGdial Meeting on Discourse and Dialogue (SIGDIAL2010), pp.205 208, 2010. [8] N. Kawaguchi, S. Matsubara, K. Takeda, and F. Itakura, CIAIR in-car speech corpus influence of driving status, IEICE Transactions on Information and Systems, vol.e88- D, no.3, pp.578 582, 2005. [9] Y. Matsumoto, A. Kitauchi, T. Yamashita, and Y. Hirano, Japanese morphological analysis system ChaSen version 2.0 manual, NAIST Technical Report, NAIST-IS- TR99009, 1999. [10] vol.22 pp.101 122 2007 [11] CBAP vol.11 no.3 pp.39 68 2004 [12] A. Lee, T. Kawahara, and K. Shikano, Julius an open source real-time large vocabulary recognition engine, Proceedings of the Seventh European Conference on Speech Communication and Technology (EUROSPEECH2001), pp.1691 1694, 2001. [13] J. Cohen, A coefficient of agreement for nominal scales, Educational and Psychological Measurement, vol.20, pp.37 46, 1960. [14] J. Carletta, Assessing agreement on classification tasks, Computational Linguistics, vol.22, no.2, pp.249 254, 1996. [15] P. Boersma and D. Weenink, Praat: doing phonetics by computer (version 5.1.05), 2009. Software available at http://www.praat.org/. [16] C.-C. Chang and C.-J. Lin, LIBSVM: a library for support vector machines, 2001. Software available at http: //www.csie.ntu.edu.tw/ cjlin/libsvm. 6