Affective social anthropomorphic intelligent system

Mamun, Md. Adyelullahil; Abdullah, Hasnat Md.; Alam, Md. Golam Rabiul; Hassan, Muhammad Mehedi; Uddin, Md Zia

dc.contributor.author	Mamun, Md. Adyelullahil
dc.contributor.author	Abdullah, Hasnat Md.
dc.contributor.author	Alam, Md. Golam Rabiul
dc.contributor.author	Hassan, Muhammad Mehedi
dc.contributor.author	Uddin, Md Zia
dc.date.accessioned	2024-05-29T12:39:14Z
dc.date.available	2024-05-29T12:39:14Z
dc.date.created	2023-03-17T14:40:44Z
dc.date.issued	2023
dc.identifier.citation	Multimedia Tools and Applications. 2023, 82, 35059-35090.	en_US
dc.identifier.issn	1380-7501
dc.identifier.uri	https://hdl.handle.net/11250/3131886
dc.description.abstract	Human conversational styles are measured by the sense of humor, personality, and tone of voice. These characteristics have become essential for conversational intelligent virtual assistants. However, most of the state-of-the-art intelligent virtual assistants (IVAs) are failed to interpret the affective semantics of human voices. This research proposes an anthropomorphic intelligent system that can hold a proper human-like conversation with emotion and personality. A voice style transfer method is also proposed to map the attributes of a specific emotion. Initially, the frequency domain data (Mel-Spectrogram) is created by converting the temporal audio wave data, which comprises discrete patterns for audio features such as notes, pitch, rhythm, and melody. A collateral CNN-Transformer-Encoder is used to predict seven different affective states from voice. The voice is also fed parallelly to the deep-speech, an RNN model that generates the text transcription from the spectrogram. Then the transcripted text is transferred to the multi-domain conversation agent using blended skill talk, transformer-based retrieve-and-generate generation strategy, and beam-search decoding, and an appropriate textual response is generated. The system learns an invertible mapping of data to a latent space that can be manipulated and generates a Mel-spectrogram frame based on previous Mel-spectrogram frames to voice synthesize and style transfer. Finally, the waveform is generated using WaveGlow from the spectrogram. The outcomes of the studies we conducted on individual models were auspicious. Furthermore, users who interacted with the system provided positive feedback, demonstrating the system’s effectiveness.	en_US
dc.language.iso	eng	en_US
dc.publisher	Springer	en_US
dc.rights	Navngivelse 4.0 Internasjonal	*
dc.rights.uri	http://creativecommons.org/licenses/by/4.0/deed.no	*
dc.title	Affective social anthropomorphic intelligent system	en_US
dc.type	Peer reviewed	en_US
dc.type	Journal article	en_US
dc.description.version	publishedVersion	en_US
dc.rights.holder	© The Author(s) 2023	en_US
dc.source.pagenumber	35059-35090	en_US
dc.source.volume	82	en_US
dc.source.journal	Multimedia Tools and Applications	en_US
dc.identifier.doi	10.1007/s11042-023-14597-6
dc.identifier.cristin	2134863
cristin.ispublished	true
cristin.fulltext	original
cristin.qualitycode	1

Tilhørende fil(er)

Filnavn:: Mamum_2023_Affective_social_an ...
Størrelse:: 3.315Mb
Format:: PDF

Åpne

Denne innførselen finnes i følgende samling(er)

Publikasjoner fra CRIStin - SINTEF AS [5801]
SINTEF Digital [2501]

Vis enkel innførsel

Med mindre annet er angitt, så er denne innførselen lisensiert som Navngivelse 4.0 Internasjonal