Perhaps you’ve heard the admonishment “look at me when I’m speaking to you.” But what if while wearing headphones you could listen to someone more clearly just by looking at them?
That’s kind of what’s happening with a new artificial intelligence system developed by University of Washington researchers in which a person wearing noise canceling headphones can “enroll” a single person into the system just by looking at them for a few seconds. The system then cancels out all other noise in the environment and plays just the enrolled speaker’s voice, even if the listener moves around and no longer faces the speaker.
Called “Target Speech Hearing,” the effort comes from the same UW team that previously developed a “semantic hearing” system for noise canceling headphones in which listeners could decide which sounds to filter out from an environment and which sounds to let in. Birds chirping? Yes. Kid screaming? Nope.
The new system relies on off-the-shelf headphones fitted with microphones. A person wearing the headphones taps a button while directing their head toward someone who is talking. Sound waves from the speaker’s voice reach the microphones on both sides of the headset and the headphones send that signal to an on-board embedded computer, where machine learning software learns the desired speaker’s vocal patterns.
The system latches onto that speaker’s voice and continues to play it back to the listener in real time, even as the pair moves around. The system’s ability to focus on the enrolled voice improves as the speaker keeps talking, giving the system more training data, according to the UW.
There are noise-canceling headphones already available, such as Apple’s AirPods Pro, which can automatically adjust sound levels during a conversation. The UW’s prototype takes it a step further and allows a user to control whom to listen to and when.
Picture wearing the headphones in a crowded restaurant or cafeteria, where background noise is making it difficult to clearly hear the person sitting across from you. The push of a button and glance at the speaker changes things.
Currently the system can enroll only one speaker at a time, and it’s only able to enroll a speaker when there is not another loud voice coming from the same direction as the target speaker’s voice. A user can run another enrollment on the speaker to improve clarity.
The team presented its findings May 14 in Honolulu at the ACM CHI Conference on Human Factors in Computing Systems. The code for the proof-of-concept device is available for others to build on. The system is not commercially available.
Watch the system in action: