This JAMA Guide to Statistics and Methods explains the basic concepts underlying convolutional neural networks (CNNs), a type of machine learning being used to automate the reading of medical images.
Neural networks, a subclass of methods in the broader field of machine learning, are highly effective in enabling computer systems to analyze data, facilitating the work of clinicians. Neural networks have been used since the 1980s, with convolutional neural networks (CNNs) applied to images beginning in the 1990s.1-3 Examples include identifying natural images of everyday life,4 classifying retinal pathology,5 selecting cellular elements on pathological slides,6 and correctly identifying the spatial orientation of chest radiographs.7 Successful neural networks for such tasks are typically composed of multiple analysis layers; the term deep learning is also (synonymously) used to describe this class of neural networks.
OPENING THE DEEP LEARNING BLACK BOX
One way to understand how CNNs work is to use an analogy of written language. Ideas are communicated in written articles that are composed of a series of paragraphs that are, in turn, made of sentences, sentences of words, and words from collections of letters. Understanding text comes after assessing the relationships of the letters to one another in increasing layers of complexity (a “deep” hierarchical representation: from letters, to words, to sentences, to paragraphs). Images are analyzed by computers via motifs, instead of letters. A motif is a collection of pixels that form a basic unit of analysis, the simplest of which represent the most basic pattern for communicating visual information, just as a letter does for language. After the computer learns the form of these motifs, they are detected in images using a filter that is matched to the motif’s structure.
Consider the image in the accompanying Video corresponding to a collection of words. An image may be considered as a map, with the location-dependent pixel value reflecting the signal strength at a given point; the collection of pixels yields an image, and in this example the image forms a set of words.
The most primitive building blocks that make up the images are on the first layer of the CNN model; these building blocks correspond to the motifs. The CNN detects these motifs by applying filters to the images. Each filter is a set of pixels that are of similar form as the respective motif. In this example, the first layer filters correspond to the letters of the alphabet. Each filter is shifted sequentially to each location in the image and measures the degree to which the local properties of the image match the ...