ABSTRACT We describe the 2017 version of Microsoft’s conversational speech recognition system, in which we update our 2016 system with recent developments in neuralnetworkbased acoustic and language modeling to further advance the state of the art on the Switchboard speech recognition task. The system adds a CNNBLSTM acoustic model to the set of model architectures we combined previously, and includes characterbased and dialog session aware LSTM language models in rescoring. For system combination we adopt a twostage approach, whereby subsets of acoustic models are first combined at the senoneframe level, followed by a wordlevel voting via confusion networks. We also added a confusion network rescoring step after system combination. The resulting system yields a 5.1% word error rate on the 2000 Switchboard evaluation set. 1. INTRODUCTION We have witnessed steady progress in the improvement of automatic speech recognition (ASR) systems for conversational speech, a genre that was once considered among the hardest in the speech recognition community due to its unconstrained nature and intrinsic variability 1. The combination of deep networks and efficient training methods with older neural modeling concepts 2, 3, 4, 5, 6, 7, 8 have produced steady advances in both acoustic modeling 9, 10, 11, 12, 13, 14, 15 and language modeling 16, 17, 18, 19. These systems typically use deep convolutional neural network (CNN) architectures in acoustic modeling, and multilayered recurrent networks with gated memory (longshortterm memory, LSTM 8) models for both acoustic and language modeling, driving the word error rate on the benchmark Switchboard corpus 20 down from its mid2000s plateau of around 15% to well below 10%. We can attribute this progress to the neural models’ ability to learn regularities over a wide acoustic context in both time and frequency dimensions, and, in the case of language models, to condition on unlimited histories and learn representations of functional word similarity 21, 22. Given these developments, we carried out an experiment last year, to measure the accuracy of a stateoftheart conversational speech recognition system against that of professional transcribers. We were trying to answer the question whether machines had effectively caught up with humans in this, originally very challenging, speech recognition task. To measure human error on this task, we submitted the Switchboard evaluation data to our standard conversational speech transcription vendor pipeline (who was left blind to the experiment), postprocessed the output to remove text normalization discrepancies, and then applied the NIST scoring protocol. The resulting human word error was 5.9%, not statistically different from the 5.8% error rate achieved by our ASR system 23. In a followup study 24, we found that qualitatively, too, the human and machine transcriptions were remarkably similar: the same short function words account for most of the errors, the same speakers tend to be easy or hard to transcribe, and it is difficult for human subjects to tell whether an errorful transcript was produced by a human or ASR. Meanwhile, another research group carried out their own measurement of human transcription error 25, while multiple groups reported further improvements in ASR performance 25, 26. The IBM human transcription study employed a more involved transcription process with more listening passes, a pool of transcribers, and access to the conversational context of each utterance, yielding a human error rate of 5.1%. Together with a prior study by LDC 27, we can conclude that human performance, unsurprisingly, falls within a range depending on the level of effort expended. In this paper we describe a new iteration in the development of our system, pushing well past the 5.9% benchmark we measured previously. The overall gain comes from a combination of smaller improvements in all components of the recognition system. We added an additional acoustic model architecture, a CNNBLSTM, to our system. Language modeling was improved with an additional utterancelevel LSTM based on characters instead of words, as well as a dialog sessionbased LSTM that uses the entire preceding conversation as history. Our system combination approach was refined by combining predictions from multiple acoustic models at both the senoneframe and word levels. Finally, we added an LM rescoring step after confusion network creation, bringing us to an overall error rate of 5.1%, thus surpassing the human accuracy level we had measured previously. The remainder Fig. 1. LACE network architecture of the paper describes each of these enhancements in turn, followed by overall results. 2. ACOUSTIC MODELS 2.1. Convolutional Neural Nets We used two types of CNN model architectures: ResNet and LACE (VGG, a third architecture used in our previous system, was dropped). The residualnetwork (ResNet) architecture 28 is a standard CNN with added highway connections 29, i.e., a linear transform of each layer’s input to the layer’s output 29, 30. We apply batch normalization 31 before computing rectified linear unit (ReLU) activations. The LACE (layerwise context expansion with attention) model is a modified CNN architecture 32. LACE, first proposed in 32 and depicted in Figure 1, is a variant of timedelay neural network (TDNN) 4 in which each higher layer is a weighted sum of nonlinear transformations of a window of lower layer frames. Lower layers focus on extracting simple local patterns while higher layers extract complex patterns that cover broader contexts. Since not all frames in a window carry the same importance, a learned attention mask is applied, shown as the “elementwise matrix product” in Figure 1. The LACE model thus differs from the earlier TDNN models 4, 33 in this attention masking, as well as the ResNetlike linear passthrough connections. As shown in the diagram, the model is composed of four blocks, each with the same architecture. Each block starts with a convolution layer with stride two, which subsamples the input and increases the number of channels. This layer is followed by four ReLU convolution layers with jumplinks similar to those used in ResNet. As for ResNet, batch normalization 31 is used between layers. 2.2. Bidirectional LSTM For our LSTMbased acoustic models we use a bidirectional architecture (BLSTM) 34 without frameskipping 11. The core model structure is the LSTM defined in 10. We found that using networks with more than six layers did not improve the word error rate on the development set, and chose 512 hidden units, per direction, per layer; this gave a reasonable tradeoff between training time and final model accuracy. BLSTM performance was significantly enhanced using a spatial smoothing technique, first described in 23. Briefly, a twodimensional topology is imposed on each layer, and activation patterns in which neighboring units are correlated are rewarded.