Seeing helps our hearing: How the visual system plays a role in speech perception

Dublin Core

Title

Creator

Brandon O’Hanlon

Date

2021

Description

Difficult listening conditions can result in a decrease in our ability to successfully discriminate speech. In these conditions, the visual system assists with speech perception through lipreading. Stimulus onset asynchrony (SOA) is used to investigate the interaction between the two senses in speech perception. Due to widely stimulus dependent effects, the exact timings for how far one stream can be asynchronized against the other drastically differs from account to account. Previous research has not considered viseme categories to ensure that selected speech phonemes are visually distinct. This study aims to create and validate a set of audiovisual stimuli that considers these variables for examining speech-in-noise, and to determine the SOA integration period for these stimuli. 27 online participants would be presented with either audio-only stimuli of a speaker speaking or audiovisual stimuli that also contained visuals of the speaker’s lip and mouth area as the speech were spoken. The speech was either clear or in-noise, and either displayed no stimulus onset asynchrony (SOA) or had SOA introduced at one of five different levels (200ms, 216.6ms, 233.3ms, 250ms, 266.6ms). Results indicate that, whilst the effect of visual information assisting with speech-in-noise is apparent, it is weaker of an effect than previous literature. Whilst response times imply that 250ms marks the integration window period for our stimuli, no significant accuracy changes corroborate this finding. In all, the study was successful in creating a more valid set of stimuli for testing. As power sufficiency was not met, more testing would be required to firmly cement the findings.

Subject

Linear mixed-effects modelling

Source

connection, allowing for a direct, uninterrupted video feed at 1920x1080 resolution and 60 frames per second. The camera was mounted onto a stable tripod to reduce movement of the camera as much as possible during recording. DroidCam X software was used to aid the streaming of the video in real-time with little compression and loss whilst still retaining a 1080p@60fps quality level. OBS software was used for recording as it allowed the audio from the external microphone and the video from the camera to be encoded together in real-time as a single MKV file. This was beneficial, as it removed potential human error that can occur when manually stitching audio files and video files together. Therefore, we can be certain that there were no asynchronous anomalies between the audio and video streams during encoding. Another benefit of OBS software is that it reports how many frames of video are dropped when recording and encoding an MKV file, which was important to ensure that the home desktop was encoding the video in its entirely akin to a lab-calibrated desktop. No frames were reported to be dropped for all speech tokens recorded. All stimuli were recorded as MKV files initially to avoid lossy compression in the recording. A software-based x264 bit CPU encoding method was used for the recording, due to a lack of internal GPU encoding method (such as Nvenc encoding) on the home system.
After the initial recording, the speech tokens were edited in length and converted to mp4 files at a resolution of 1280 x 720p and a frame rate of 60 frames per second. As the study would be completed on participant’s laptops or desktop systems and using their internet connection, we cannot ensure that all participants are using a device with a 1920 x 1080p resolution screen. By reducing the resolution of files to 720p, all potential participant resolution sizes can be accommodated whilst ensuring all participants view the files at the same resolution. For audio-only conditions, the video of the lips was overlayed with a plain black PNG image file. This kept the audio-only stimuli in video format rather than export the file as an mp3. Regarding the inability to control the internet connection speeds of each participant, the experiment was set to download all stimuli as browser cache before it began, ensuring that there were no latency differences.
Audacity software (Audacity Team, 2021) was then used to rip the audio from the MKV files to be edited as WAV files in Praat software (Boersma & Weenink, 2021) for the creation of speech-shaped noise. First, a sentence using English words – ‘His plan meant taking a big risk’ - was recorded to provide a base for the speech-shaped noise. White noise was then produced using Praat’s white noise generator. The noise was brought down to an intensity tier, then an amplitude tier. This was then multiplied with the sentence above to create speech-shaped noise. Praat was then used to combine the speech-shaped noise with the speech-in-noise conditions at a speech to noise ratio of minus 16dB. This was done using a Praat script developed by McCloy (2021). Finally, Audacity was used again to ramp up the start and ramp down the ends of all audio files for every condition. The audio was then stitched back onto the MP4 files.
For the conditions where the onset of the stimuli was asynchronous, Lightworks was again used to displace the audio ahead of the onset of the speech token using exact frames of the video footage (12, 13, 14, 15, and 16 frames per second) which corresponded with the stimulus onset asynchrony levels of the relevant conditions. The result was 42 stimuli in MP4 format, representing three speech tokens (Ba, Fa, and Ka) for each of the 14 conditions presented to the participant. These were uploaded to a GitHub repository to be accessed by Pavlovia during the experiment.
Procedure
Participants were first given a participant code and a link to the online Qualtrics consent and screening forms via email. A copy of the participant information sheet was displayed at the start of the Qualtrics questionnaire to remind participants of the study to ensure informed consent was given. Participants were also reminded at this stage to ensure that they were in a quiet room with no background noise, as well as to load the experiment on either Microsoft Edge, Google Chrome, or Mozilla Firefox internet browsers on a laptop or desktop computer. They were explicitly told not to open the experiment on any other browser, such as Safari, nor on a mobile or tablet device as these were incompatible. Once consent had been given and the participant had met the screening criteria based on their answers, they were automatically redirected to the experiment on Pavlovia. If a participant did not meet the criteria for the study, they were redirected to a message informing them of their ineligibility and they were prevented from proceeding to the rest of the experiment. To begin the experiment, participants were once again reminded of browser and device limitations and told to use headphones in a quiet room. If a participant was using an incompatible device or browser to load the experiment, they were instructed to close the experiment and re-open it on the correct device or browser before beginning.
A volume check began, in which a constant A tone played, and participants were asked to adjust the volume of their device as necessary for a comfortable auditory experience and to ensure that the audio was playing correctly at a sufficient volume level. In a typical lab setting, a set volume would be decided for all participants. However, as the study was completed online on the participant’s own devices, settling for the participant’s preferred hearing volume was preferable instead. Once complete, the spacebar would be pressed, and the tone stopped. Participants were then given a brief explanation of the task to complete. They were informed that a video would play either showing no visual information or visual information of lips moving. Meanwhile, speech would be played. Participants were told to listen carefully to the speech sound spoken, and after hearing the sound to press one of three buttons on their keyboards that corresponded with the three available speech tokens. They were reminded before and after each trial to press 'z' on their keyboard if they heard "Ba", 'x' for "Fa", or 'c' for "Ka”. Participants were told to answer as quickly as possible. If they were unsure, they were told to make a guess.
To begin, participants were given 6 practise trials to attempt the task before data was collected. This was using the clear, 0ms, audiovisual condition stimuli, with 2 trials for each of the 3 speech tokens (Ba, Fa, and Ka). A white crosshair would be displayed on the screen for 1000ms before the trial began to bring attention to the centre of the screen where the video trials would be displayed. Stimuli were shown for 2500ms, then the response screen would display. On this screen, the participants were reminded of the buttons to press for each of the three speech sounds. Only the three buttons could be pressed and pressing the buttons whilst the stimuli were still playing was not possible. The first key pressed after the stimuli were played was recorded and then would take the participant to a relay screen, where they would be informed to press the spacebar to continue. Upon pressing the spacebar, the white crosshair would return, and the next trial began.
After completing the practice, the participant was reminded of the task details once more before the experiment began for real. A total of 546 trials (not including the practise trials) were completed. The order of the trials and conditions was completely random to counterbalance any potential order bias. Every 42 trials, a broken screen would appear. This screen told the participant to take a short break before continuing with a press of the spacebar. If the participant did not wish to take a break, they were permitted to continue with a spacebar press immediately. There was a total of 12 breaks in the experiment. After each break, participants were asked a basic mathematics question, for example: ‘What is 3 +2?’. Participants could only proceed to the next chunk of trials if they responded with the correct answer. This was put in place to ensure that participants were continuing to pay attention to the experiment. Upon reaching the end of the final trial, participants were shown an ending screen where they were informed that the experiment had ended. Participants were also informed to email the primary researcher for debriefing information. Upon completing the study, participants could close the browser tab or window down and all data would remain recorded on the Pavlovia system.
If a participant closed the browser tab or window during the experiment, partial data would be recorded up to the last trial that they responded to. If this was by mistake, participants could open the experiment again and restart. However, progress would not be saved, and the participant would have to start the experiment again from scratch. Using the same participant code would not overwrite the participant’s previous data, and instead created a new participant dataset. Full datasets were used over the partial dataset in this case, unless no full dataset was recorded for a participant.
Analysis
Descriptive statistics were first gathered from each condition for both the accuracy ratings and the reaction times. The assumptions of linear and generalised linear mixed-effects models were tested, including residual plots to check for linearity, quantile-quantile plots for normality, assessing the levels of multicollinearity between stimuli type, speech type, and stimulus onset asynchrony levels using variance inflation factors, and ensuring the assumption of homoscedasticity is met.
Using both lmerTest (Kuznetsova et al., 2020) and lme4 packages, a combination of both linear mixed-effects regression model (LMER) analyses for the response time scores and generalized linear mixed-effects regression model (GLMER) analyses for the accuracy scores were conducted. LMERs were chosen instead of repeated measures generalised linear models like ANOVA tests because it considers random effects that may be present across all 546 trials on a participant-by-participant basis. As accuracy is inherently bound – due to either being accurate or inaccurate only – it can be argued to be categorical. Therefore, GLMERs were used for accurate analyses to ensure that assumptions of categorical dependent variables in mixed-effects models are met. For the LMER analyses, there were two models. Model 1 used response times as the dependent variable, modelled with stimuli type and speech type as fixed effects. The interactive effect between stimuli type and speech type was also included in the model. Model 2 used response times as the dependent variable, modelled with speech type and stimulus onset asynchrony timings as fixed effects. The interactive effect between speech type and stimulus onset asynchrony timings was also included in the model.
The GLMER analyses also had two models. Model 1 used accuracy as the dependent variable, modelled with stimuli type and speech type as fixed effects, including the interactive effects between the two fixed effects. Model 2 used accuracy as the dependent variable, modelled with speech type and stimulus onset asynchrony timings as fixed effects. Again, interactive effects were included. For all four analyses, the speech sound token used (Ba, Fa, or Ka), participant age, and the participant ID were all included as random effects in the respective models.

Publisher

Lancaster University

Format

Excel File

Identifier

O’Hanlon2021

Contributor

Stephanos Mosfiliotis

Rights

Open (unless stated otherwise)

Relation

None (unless stated otherwise)

Language

English

Type

Data

Coverage

LA1 4YF

LUSTRE

Supervisor

Dr Helen E. Nuttall

Project Level

MSC

Topic

Cognitive, Perception

Sample Size

48 participants (11 male, 14 female, 2 non-binary)

Statistical Analysis Type

Quantitative

Files

Collection

RT & Accuracy

Citation

Brandon O’Hanlon, “Seeing helps our hearing: How the visual system plays a role in speech perception,” LUSTRE, accessed May 3, 2024, https://www.johnntowse.com/LUSTRE/items/show/122.