LOIS: Looking Out of Instance Semantics for Visual Question Answering

Zhang, S; Chen, Y; Sun, Y; Wang, F; Shi, H; Wang, H

Please use this identifier to cite or link to this item: http://bura.brunel.ac.uk/handle/2438/28249

Full metadata record

DC Field	Value	Language
dc.contributor.author	Zhang, S	-
dc.contributor.author	Chen, Y	-
dc.contributor.author	Sun, Y	-
dc.contributor.author	Wang, F	-
dc.contributor.author	Shi, H	-
dc.contributor.author	Wang, H	-
dc.date.accessioned	2024-02-07T18:22:47Z	-
dc.date.available	2024-02-07T18:22:47Z	-
dc.date.issued	2023-12-26	-
dc.identifier	ORCID ID: Siyu Zhang https://orcid.org/0000-0002-0001-0204	-
dc.identifier	ORCID iD: Yaoru Sun https://orcid.org/0000-0002-2179-0713	-
dc.identifier	ORCID iD: Fang Wang https://orcid.org/0000-0003-1987-9150	-
dc.identifier	arXiv:2307.14142v1 [cs.CV]	-
dc.identifier.citation	Zhang, S. et al. (2023) 'LOIS: Looking Out of Instance Semantics for Visual Question Answering', IEEE Transactions on Multimedia, 0 (early access), pp. 1 - 13. doi: 10.1109/TMM.2023.3347093.	en_US
dc.identifier.issn	1520-9210	-
dc.identifier.uri	https://bura.brunel.ac.uk/handle/2438/28249	-
dc.description	The fiel archived on this institutional repository is a preprint available at arXiv:2307.14142v1 [cs.CV] (https://doi.org/10.48550/arXiv.2307.14142). It has not been certified by peer review. You are advised to use the peer reviewed version published by IEEE at https://doi.org/10.1109/TMM.2023.3347093 .	-
dc.description.abstract	Visual question answering (VQA) has been intensively studied as a multimodal task, requiring efforts to bridge vision and language for correct answer inference. Recent attempts have developed various attention-based modules for solving VQA tasks. However, the performance of model inference is largely bottlenecked by visual semantic comprehension. Most existing detection methods rely on bounding boxes, remaining a serious challenge for VQA models to comprehend and correctly infer the causal nexus of contextual object semantics in images. To this end, we propose a finer model framework without bounding boxes in this work, termed <italic>Looking Out of Instance Semantics (LOIS)</italic> to address this crucial issue. LOIS can achieve more fine-grained feature descriptions to generate visual facts. Furthermore, to overcome the label ambiguity caused by instance masks, two types of relation attention modules: 1) intra-modality and 2) inter-modality, are devised to infer the correct answers from different visual features. Specifically, we implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information. In addition, our proposed attention model can further analyze salient image regions by focusing on important word-related questions. Experimental results on four benchmark VQA datasets prove that our proposed method has favorable performance in improving visual reasoning capability. Our code is available on GitHub (<uri>https://github.com/ArcherCYM/LOIS</uri>).	en_US
dc.format.extent	1 - 13	-
dc.format.medium	Print-Electronic	-
dc.language.iso	en_US	en_US
dc.publisher	IEEE	en_US
dc.relation.uri	https://arxiv.org/abs/2307.14142	-
dc.rights	Copyright © 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works (see: https://journals.ieeeauthorcenter.ieee.org/become-an-ieee-journal-author/publishing-ethics/guidelines-and-policies/post-publication-policies/)..	-
dc.rights.uri	https://journals.ieeeauthorcenter.ieee.org/become-an-ieee-journal-author/publishing-ethics/guidelines-and-policies/post-publication-policies/	-
dc.subject	visual question answering (VQA)	en_US
dc.subject	instance semantics	en_US
dc.subject	visual features	en_US
dc.subject	multimodal relation attention	en_US
dc.title	LOIS: Looking Out of Instance Semantics for Visual Question Answering	en_US
dc.type	Article	en_US
dc.identifier.doi	https://doi.org/10.1109/TMM.2023.3347093	-
dc.relation.isPartOf	IEEE Transactions on Multimedia	-
pubs.publication-status	Published	-
pubs.volume	0	-
dc.identifier.eissn	1941-0077	-
dc.rights.holder	IEEE	-
Appears in Collections:	Dept of Computer Science Research Papers

Files in This Item:

File	Description	Size	Format
FullText.pdf	Copyright © 2023 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works (see: https://journals.ieeeauthorcenter.ieee.org/become-an-ieee-journal-author/publishing-ethics/guidelines-and-policies/post-publication-policies/)..	5.03 MB	Adobe PDF	View/Open

Show simple item record