Research in the SRO-NICE is directed at improving human computer interactions. The major focus of the research is on improving the efficiency of systems and their their appreciation by users through enhancing the intelligence of the systems so that they come to understand the user’s intentions by processing a wide range of input modalities. By taking into account the context of use, they can provide their reactions in the most appropriate form. SRO-NICE research deals with many issues covering methodology, technology and human factors.
The Human Media Interaction (HMI) group and the Database (DB) group are two participants of the SRO-NICE that have a strong joint background in multimedia retrieval. The QAVID project -- QAVID stands for "Question answering from instruction videos" -- is a collaborative research project on multimodal question answering, a broad project involving many subfields of computer science such as multimedia indexing and annotation, multimedia databases, XML and MPEG-7, question answering, and human computer interaction.
The project's goal is to bring video annotation and retrieval techniques to non-professional users by adapting techniques that were developed for video retrieval from news broadcast to video retrieval from instruction video's. We envision a cooking scenario in which the user, while working in the kitchen, may ask a search system questions such as "How do I poach this egg?", "How do I make a Dutch pea soup?", "How to cut this onion without crying?", "How do I use this?" (holding a passion fruit), etc. We envision a system that does extensive annotation of video's at indexing time using in-house standard software such as a large-vocabulary speech recognition system, a shot detector, low-level feature detectors for query-by-example on key frames, high-level feature detectors such as a face detector, etc. to enable "how-to" questions in instruction video search scenario's such as the cooking scenario. In such a system, a how-to question might be translated to a structured query that combines information from several so-called stand-off annotations of the instruction videos.