Measuring quality of content and its audience related engagement
factor is a key performance of marketing, media and journaling. The
effective content to impact audience has to increase its engagement
level. Affective Computing (AC) has been a popular area of research
for several years where machines detect and understand human
affective states, such as emotions, interests and the behavior. It is
assumed that to become more user-friendly and effective, systems
need to become sensitive to human emotions. Nonverbal information
is considered to complement the verbal message providing a better
interpretation of the message. It is claimed that 70–90% of
communication between humans is nonverbal. The studies conducted
by Albert Mehrabian in 1967 established the 7%–38%–55% rule,
also known as the “3V rule”: 7% of the communication is verbal,
38% of the communication is vocal and 55% of the communication is
visual [1].
The first study in the field of emotion detection was born during
the sixties/seventies. The most prominent example is that of mood
rings [2]. The principle is simple; rings contain thermotropic liquid
crystals that react with body temperature. When a person has
stressed, his mood ring take on a darker color.
The scientific publications of Rosalind Picard (MIT) have
introduced a great progress in this field since the nineties [3, 4]. He is
one of the pioneers of affective computing. In his book “Affective
Computing”, Picard proposed that emotion can be modeled using the
nonlinear sigmoid function. Over the last 20 years, the development
of technology has allowed the implementation of relatively good
system market and efficient such as ambient intelligence (AMI),
virtual reality (VR) and augmented reality (AR).
Nowadays, in the automotive field for example, an on-board
computer that is able to detect confusion, interest or fatigue can
increase safety. The AutoEmotive (MIT Media Lab) is a prototype
equipped with sensors and a camera placed on the steering wheel [5].
This vehicle measures the level of stress and fatigue of the driver.
When the need arises, he puts a background music, changes the
temperature and light in the vehicle interior, or still proposes to
follow a less stressful journey.
A multimodal system is widely adopted and several multimodal
datasets include sentiment annotations. Zadeh et al. introduced the
first multimodal dataset (MOSI) with opinion-level sentiment
intensity annotations and studying the prototypical interaction
patterns between facial gestures and spoken words when inferring
sentiment intensity. A multimodal dictionary using language-gesture
study is proposed in a speaker-independent model for sentiment
intensity prediction [6]. For other examples of data sets we can cite
ICT-MMMO [7] and MOUD [8] datasets. Intra-modality dynamics is
modeled through three Modality Embedding Subnetworks, for
language, visual and acoustic modalities, respectively [9]. LTSMbased
network to extract contextual features from the video for
multimodal sentiment analysis is shown in [10]. A multimodal
sentiment analysis framework, which includes sets of relevant
features for text and visual data, as well as a simple technique for
fusing the features extracted from different modalities [11].
Multimodal emotion analysis has the following challenge: (1)
model the interactions between language, visual and acoustic
behaviors that change the observation of the expressed emotion
(named the inter-modality dynamics). (2) Multimodal emotion
analysis (named intra-modality dynamics) is to efficiently explore
emotion, not only on one but also on highly expressive nature
modality (ex.-spoken language where proper language structure is
often ignored, video and acoustic modalities which are expressed
through both space and time.
The emotion analysis lacks the ability to measure the engagement
between the user and the content, the interaction with user to
influence the user decision, and keep the user in front of the content.
This paper presents a new model to measure the user behavior
emotion trigger and measure the engagement level of the user. It also
demonstrate a technique to personalized the content and introduce a
metric to measure engagement. The reset of the paper is structure as
follows: Section II presents s review of emotion and sentiment
analysis. Section III presents the proposed model Section IV
Experiment and performance evaluation. And finally Section V
conclude this paper.
II. EMTION AND SENTIMENT ANALYSIS
“Sentiment analysis is the field of study that analyses people’s
opinions, sentiments, evaluations, appraisals, attitudes, and emotions
toward entities such as products, services, organizations, and their
attributes. It represents a large problem space. There are also many
names and slightly different tasks, e.g., sentiment analysis, opinion
mining, opinion extraction, sentiment mining, subjectivity analysis,
affect analysis, emotion analysis, review mining, etc.” [12]
Sentiment Analysis (SA) [13] is a computational study of how
opinions, attitudes, emoticons and perspectives are expressed in
language. Sentiment Detection, or in its simplified form – Polarity
Classification, is a tedious and complex task. Contextual changes of
polarity indicating words, such as negation, sarcasm as well as weak
syntactic structures make it troublesome for both machines and
humans to safely determine polarity of messages.
Sentiment analysis methods involve building a system to collect
and categorize opinions about a product. This consists in examining
natural language conversations happening around a certain product
for tracking the mood of the public. The analysis is performed on
large collections of texts, including web pages, online news, Internet
discussion groups, online reviews, web blogs, and social media.
Opinion Mining aims to determine polarity and intensity of a given
text, i.e., whether it is positive, negative, or neutral and to what
extent. To classify the intensity of opinions, we can use methods
introduced in [14, 15, 16, 17].
Text Mining and Social Network Analysis have become a
necessity for analyzing not only information but also the connections
across them. The main objective is to identify the necessary
information as efficiently as possible, finding the relationships
between available information by applying algorithmic, statistical,
and data management methods on the knowledge. The automation of
sentiment detection on these social networks has gained attention for
various purposes [18, 19, 20, 21].
The aim of [22] was to report on the associations between
depression severity and the variability (time-unstructured) and
instability (time-structured) in emotion word expression on Facebook
and Twitter across status updates. Several works on depression have
emerged. They are based on social networks: Twitter [23, 24] and
Facebook [25, 26].
Several authors have been interested in the use of emoticons to
complete the sentiment analysis. Authors in [27] utilize Twitter API
to get training data that contain emoticons like :) and :(. They use
these emoticons as noisy labels. Tweets with :) are thought to be
positive training data and tweets with :( are thought to be negative
training data. In [28], authors present the ESLAM (Emoticon
Smoothed LAnguage Models) which combine fully supervised
methods and distantly supervised methods. Although many TSA
(Twitter Sentiment Analysis) methods have been presented. The
authors in [29] explored the influence of emoticons on TSA.
Automatic emotion recognition based on utterance level prosodic
features may play an important role within speaker-independent
emotion recognition [30]. The recognition of emotions based on the
voice has been studied for decades [31, 32, 33, 34]. Paper in [35]
focused on mono-modal systems with speech as only input channel.
Artificially influence mental and emotional states to get a better
individual performance in stress-related occupations and prevent
mental disorders from happening [36]. Recent research has shown
that under certain circumstances multimodal emotion recognition is
possible even in real time [37].
Sound signals (including human speech) is one of the main
mediums of communication [38] and it can be processed to recognize
the speaker or even emotion. There are some physical features
applied for indexing speech, like: spectrum irregularity, wide and
narrow band spectrograms, speech signals filtering and processing,
enhancement and manipulation of specific frequency regions,
segmentation and labeling of words, syllables and individual
phonemes [37]. Moreover, the Mel-Frequency Cepstral Coefficients
(MFCC) is widely used in speech classification experiments [39]. For
the reduction of leakage effect, the Hamming window is
implemented. This is necessary for increasing the efficiency of
frequency in human speech [38].
MPEG 7 Audio standard contains descriptors and description
schemes that can be divided into two classes: generic low-level tools
and application-specific tools [40]. Artificial Neural Networks
(ANN), k-Nearest Neighbor (k-NN) and Support Vector Machines
(SVM), decision trees, probabilistic models such as the Gaussian
mixture model (GMM) or stochastic models such as Hidden Markov
Model (HMM) can be applied [36].
Emotion analysis of speech is possible; however, it highly
depends of the language. Study by Chaspari et al. showed that
emotion classification in speech (Greek language) achieved accuracy
up to 75.15% [41]. Similar study by Arruti et al. showed mean
accuracy of 80.05% emotion recognition rate in Basque and a
74.82% in Spanish [42].
Nonverbal behavior constitutes useful means of communication
in addition to spoken language [43] identifies at least six
characteristics from posed facial actions that enable emotion
recognition: morphology, symmetry, duration, speed of onset,
coordination of apexes and ballistic trajectory. They are common to
all humans confirming Darwin’s evolutionary thesis. Therefore, an
emotional recognition tools based on facial video is universal.
Automatic detection of emotions from facial expressions are not
simple and their interpretation is largely context-driven. To reduce
the complexity of automatic affective inference, measurement and
interpretation of facial expressions, Ekman and Friesen developed in
1978 special system for objectively measuring facial movement; the
Facial Action Coding System (FACS) [45]. FACS, based on a system
originally developed by a Swedish anatomist named Hjortsjö [46]
became the standard for identifying any movement of the face. Later,
Ekman and Sejnowski studied also computer based facial
measurements [47].
Automatic emotion recognition based on physiological signals is
a key topic for many advanced applications (safe driving, security,
mHealth, etc.). Main analyzed physiological signals useful for
emotion detection and classification are:
• electromyogram (EMG) - recording of the electrical
activity produced by skeletal muscles,
• galvanic skin response (GSR) - reflecting skin resistance,
which varies with the state of sweat glands in the skin controlled by
the sympathetic nervous system, where conductance is an indication
of psychological or physiological state,
• respiratory volume (RV) - referring to the volume of air
associated with different phases of the respiratory cycle,
• skin temperature (SKT) - referring to the fluctuations of
normal human body temperature,
• blood volume pulse (BVP) - measures the heart rate,
• heart rate (HR),
• electrooculogram (EOG) - measuring the corneo-retinal
standing potential between the front and the back of the human eye,
• photoplethysmography (PPG) - measuring blood volume
pulse (BVP), which is the phasic change in blood volume with each
heartbeat, etc.
The recognition of emotions based on physiological signals
covers different aspects: emotional models, methods for generating
emotions, common emotional data sets, characteristics used and
choices of classifiers. The whole framework of emotion recognition
based on physiological signals has recently been described by [55].