VoiceShield: Teaching Computers to Distinguish Real Data From Fake

Zachary Trefler

Age 18 | Waterloo, Ontario

Canadian Acoustical Association Award 2018 | Challenge Award 2018 | Canada-Wide Science Fair Excellence Award: Senior Bronze Medallist 


Voice and image recognition systems powered by machine learning models are coming into widespread use for many applications, including for securing critical data. However, adversarial attacks using faked data may undermine the security of these systems.

Machine learning is a powerful computational problem-solving approach that tells a computer what the solution to a problem should look like; through many rounds of training, the computer develops a method to reach a solution. This contrasts with traditional software, in which humans write exact instructions specifying how to solve problems. 

One important type of machine learning framework, the Artificial Neural Network, is a mathematical structure based loosely on the networks of neurons in the brain (Kleene 1951). A neural network ‘learns’ as real neurons might, by providing positive or negative feedback to helpful or unhelpful connections during the model’s training, encouraging computational structures to self-form. By the end of training, these structures reliably produce correct solutions, otherwise known as models.

 A person’s voice is unique to the individual, and is very difficult to imitate; thus, a ‘voice print’ is considered effective for data security. However, although humans can easily tell two voices apart, a human voice contains so many parameters that programming a voice classifier by hand is impossible. Machine learning has enabled the development of accurate voice recognition-based security, known technically as ‘speaker verification’ (SV). These SV systems are now widely deployed in important applications such as in banking systems and on smartphones.

But are these systems truly secure?

One important method used to test modern machine learning systems is to strategically expose a model to data that has been manipulated in an effort to break it. Successfully withstanding these attacks is proof of a model’s robustness.

The Generative Adversarial Network (GAN) (Goodfellow et al., 2014) is a machine learning approach introduced just four years ago, that creates two paired neural networks, one (the ‘generator’) that generates realistic images, the other (the ‘discriminator’) that categorizes them. During training, the paired networks are pitted against each other, so that as the discriminator becomes better able to distinguish real images from generated ones, the generator creates more realistic images. The generator gets very good at ‘fooling’ the discriminator, so the discriminator effectively learns to handle unexpected data. Because GANs generate very good fakes, I theorized that they could be used to undermine existing speaker verification systems by generating very realistic, but fake, voice data. This project assessed the vulnerability of existing speaker verification technologies to generative adversarial attacks, and proposes a solution to mitigate this vulnerability.

This project investigated speaker verification for three reasons (Wang, Li, Tang, & Zheng, 2017):

1. It is of high security importance;

2. It is complex enough to be vulnerable to adversarial approaches; and

3. Audio attacks could be easily deployed without detection - hackers could use the GAN approach to easily generate fake voice prints and compromise secure systems.

The solution developed in this project, VoiceShield, is a GAN-style system of linked neural networks that trains a speaker verification model while showing it faked data, so that it learns to recognize such data. VoiceShield thus provides a novel solution to the problem of adversarial attacks that use fake data to undermine critical SV security systems.


The goal of this research was to determine whether existing SV systems are vulnerable to faked voice prints, and, if so, to address this vulnerability. To do this, VoiceShield proceeded in three phases:

1. Implement a traditional discriminator model to perform speaker verification, which performs successfully on an ordinary dataset.

2. Develop a generator model to disguise one speaker as another, which clearly defeats the traditional discriminator, while preserving (i.e. not distorting) the spoken words themselves.

3. Create an adversarially re-trained version of the discriminator, which, having been exposed to the generator during training, classifies both disguised and ordinary speakers correctly.

With successful implementation of Phase 3, the improved discriminator would be more robust, and therefore more secure, than the original.

First, a database of spoken words (Melin, 1999) - comprising oscillogram data transformed into spectral features as per SV standard practice - was used to train a neural network discriminator D, which used convolutional and densely-connected ELU (Clevert, Unterthiner, & Hochreiter, 2015) hidden layers to identify a speaker spectrogram as being of ‘Speaker A’ or ‘Speaker B’ (D(spectrogram) → speaker). An 80%/20% train/test split was used, with norms and dropout for regularization. A similar classifier C was also trained, but instead of identifying a speaker, C classified an input as words (C(spectrogram) → text). This classifier was used as a check to ensure that words were not garbled or semantically morphed during the generative step.

Next, a generative model G was trained to transform a spectrogram of Speaker A to one of Speaker B, or vice versa (G (spectrogram) → fake spectrogram). This model, conceptually similar to CycleGan (Zhu, Park, Isola, & Efros, 2017), used convolutional upand downsampling with residual layers, and trained using the inverse loss of D on G’s outputs. These were also passed through C, to ensure G was preserving meaning while shifting the speaker, creating a semi-supervised GAN system (Saliman et al., 2016).

Figure 1. Real spectrogram (left) and faked transfer to another voice (right), which fools D but not D*.

Figure 1. Real spectrogram (left) and faked transfer to another voice (right), which fools D but not D*.

Finally, G and D were re-trained simultaneously against each other, by backpropagating D’s loss on G’s output to improve D. As one network improved, the other got worse - known as a zero-sum framework. This resulted in the unchanged network G and a modified discriminator D*. This process formed the basis of the VoiceShield system, whose goal was to produce a robust discriminator (D*) from a vulnerable one (D).

Figure 2. Schematic representation of VoiceShield.

Figure 2. Schematic representation of VoiceShield.


The networks were each trained for 100 epochs, and then 100 more with a decreasing learning rate. Once trained, both D and D* achieved final losses of approximately 0.1 (dimensionless) on the standard voice datasets. However, when D was exposed to adversarially manipulated data, it produced much higher losses, near 0.5, while D* kept losses down near 0.1. Essentially, D was fooled by the fake data, while D* was not.

As predicted, the traditional discriminator D failed to perform adequately given adversarially-generated fake data. However, when the VoiceShield method was used, the resultant discriminator D* proved to be much more resistant to adversarial attacks.


The results reported above conclusively demonstrate the effectiveness of VoiceShield in securing speaker verification models, even in the face of machine-learning based adversarial attacks via faked voice prints. VoiceShield thus provides a novel solution to the problem of adversarial attacks using fake data that can undermine critical machine learning-based voice recognition security systems. With the fake data generation capabilities of machine learning systems increasing rapidly, real-life attacks like those described here will become increasingly sophisticated. This line of research is thus both interesting from a computational point of view, and of critical security importance for systems in use today.

It is recommended that VoiceShield be used to re-train existing speaker verification models to secure them against adversarial audio attacks. Moving forward, VoiceShield is fully scalable, and will become more effective with the addition of larger datasets. This can include other ML systems, such as image and facial recognition and malware detection, and given the importance of these systems in modern computing, it is recommended that the efficacy of the VoiceShield approach be investigated on these systems in the near future.


Clevert, D., Unterthiner, T., & Hochreiter, S. (2015). Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). CoRR. arXiv:1511.07289

Goodfellow, I. J., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., …Bengio, Y. (2014). Generative Adversarial Networks. arXiv. arXiv:1406.2661

Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8), 2554-2558. doi:10.1073/pnas.79.8.2554

Kleene, S. C. (1951). Representation of Events in Nerve Nets and Finite Automata. Automata Studies, AM(34), 1-98. doi:10.1515/9781400882618-002

Li, L., Lin, Y., Zhang, Z., & Wang, D. (2015). Improved Deep Speaker Feature Learning for Text-Dependent Speaker Recognition. CoRR. arXiv:1506.08349

Melin, H. (1999). Databases for speaker recognition: Activities in COST250 working group 2. COST 250-Speaker Recognition in Telephony, Final Report 1999.

Pannous. (2016). spoken_numbers_wav [Data file]. Retrieved from http://pannous.net/files/

Reynolds, D. A. (2002). An Overview of Automatic Speaker Recognition Technology. Acoustics, speech, and signal processing (ICASSP), 2002 IEEE international conference on(4). IEEE.

Sahidullah, M. & Kinnunen, T. (2016). Local spectral variability features for speaker verification. Digital Signal Processing, 50. doi:10.1016/j.dsp.2015.10.011

Salimans, T., Goodfellow, I. J., Zaremba, W., Cheung, V., Radford, A., & Chen, X. (2016). Improved Techniques for Training GANs. arXiv. arXiv:1606.03498

Samuel, A. L. (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development, 3(3), 210-229. doi:10.1147/rd.33.0210

Wang, D., Li, L., Tang, Z., & Zheng, T. F. (2017). Deep Speaker Verification: Do We Need End to End? CoRR. arXiv:1706.07859

Zhang, H., Xu, T., Li, H., Zhang, S., Huang, X., Wang, X., & Metaxas, D. N. (2016). StackGAN: Text to Photo-realistic Image Synthesis with Stacked Generative Adversarial Networks. CoRR. arXiv:1612.03242

Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., & Metaxas, D. N. (2017). StackGAN++: Realistic Image Synthesis with Stacked Generative Adversarial Networks. CoRR. arXiv:1710.10916

Zhu, J., Park, T., Isola, P. & Efros, A. A. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. CoRR. arXiv:1703.10593


Zachary Trefler.png

Hi, I’m Zach Trefler. I’m an 18-year-old data science researcher, currently studying computer science and physics jointly at the University of Toronto. I grew up in Waterloo, Ontario, where the combined influences of the University of Waterloo and the Perimeter Institute helped guide me towards my interests in physics and CS, especially at their intersection. I started reading CS textbooks at age 9, around the time I participated in my first science fair. I presented two projects at the Canada-Wide Science Fair: a machine learning algorithm to predict harmful algal blooms, which won a silver medal; and a means of breaking complex speaker verification models with adversarial attacks, and a strategy to make similar models resistant to the same attacks. This last project won best in the “information” category, and I was invited to present it at the Prime Minister’s Science Fair in Ottawa.