Artificial Intelligence (AI)
Discuss current events in AI and technological innovations with Intel® employees
651 Discussions

Convolutional Neural Networks: Critical but Challenging

MaryT_Intel
Employee
0 0 1,607

Authors: Anthony Reina, Ravi Panchumarthy

How do limitations in memory capacity affect the performance of CNNs? Researchers from Intel Corporation and the University of Pennsylvania have addressed that question—and generated important results. A recent study we published evaluated the pros and cons of specific Deep Learning tactics that are frequently used in image analysis to overcome common memory limitations.

Read the Paper

Convolutional Neural Networks (CNNs) have fundamentally changed how we use computers to perform a number of critical real-world functions, such as analyzing medical images in search of tumors, processing satellite imaging in times of natural disaster, and much, much more.

But CNNs have a downside: large, high resolution images require high performance computer hardware equipped with large amounts of memory.

How do limitations in memory capacity affect the performance of CNNs? Researchers from Intel Corporation and the University of Pennsylvania have addressed that question—and generated important results. A recent study we published entitled “Systematic Evaluation of Image Tiling Adverse Effects on Deep Learning Semantic Segmentation,” evaluated the pros and cons of specific Deep Learning tactics that are frequently used in image analysis to overcome common memory limitations.

CNN network models can be very effective in image classification (identifying objects in images), image-to-image translation (mapping an input image to an output image), and image segmentation (classifying pixels within an image according to identified objects). However, limitations in computer hardware (especially memory available on deep learning accelerator cards such as GPUs) can make it difficult or impossible to process relatively large images in their original size and resolution. CNN modeling involves a technique called “activation mapping” that examines intermediate layers in various regions of the image, and these maps can require several times the memory footprint of the original input image. In fact, activation maps can easily increase required memory to hundreds of gigabytes for large images.

Lack of adequate memory can be addressed using a method known as “tiling.” In this procedure, the image to be analyzed is “cut up” into smaller (typically overlapping) rectangular tiles. Each of these tiles is then analyzed individually to reduce computing power needs. When all tiles have been analyzed, they are then stitched back together to create a fully processed version of the original image.

Tiling is a natural fit for CNNs due to the need to perform Deep Learning training processes. In the machine learning world, training refers to an iterative process where multiple layers of an image are examined to extract higher-level features from raw data. For example, training can uncover “hidden” features in an image, such as the figure of a person set against a forest. Using tiling, training can be performed on cut-up images of a small size, resulting in a series of trained models. Then, the trained models are used to infer results about the original image.

So far, so good. But the tiling process is not perfect. Our research demonstrated that tiling, when used in typical CNNs, introduces small but relevant differences during inferencing that can be detrimental to the overall success of the analysis. To explore this problem, we examined these differences in both medical and satellite images. Then we compared 2D and 3D segmentation models to see if providing CNN models with a wider context image could lead to more accurate, more consistent predictions.

Tiling Pros and Cons

The first step in our study was to crop the large original image at uniformly spaced offsets, resulting in a 2-dimensional grid (N x N) for flat images, or a 3-dimensional matrix (N x N x N for 3D structures). The tiling process introduced additional model hyperparameters, including tile size, overlap amount, and aggregation process (e.g., tile averaging/rounding), that must be tuned to generate better predictions. This is one of the stages that generates large amounts of data, so the process must be divided into phases. During one of these processing phases, the probabilities for overlapping tile predictions are averaged to produce a better “Dice” Coefficient”—a statistic used to gauge the similarity of two samples.

A previous study looked at a 2D U-Net CNN model that was trained to detect glial tumors from brain Magnetic Resonance Imaging (MRI). Results indicated that analyzing an entire 2D, untiled image produces better predictions than the tiling approach. Our study extended that investigation by systematically (1) evaluating the resulting effects in both medical and non-medical data, (2) comparing both 2D and 3D U-Net models, and (3) reviewing whether these differences were caused by operations within the CNN model that vary due to translations during the input of the model. Finally, it showed that these issues can be partially addressed by increasing the size of the tile—up to and including training and inferring on the whole image.

We used two different sets of data in our research:

Dataset 1: The medical data used was based on the publicly available training dataset of the International Brain Tumor Segmentation (BraTS) challenge 2019.

FIGURE 1. Example of a 3D input multi-parametric Magnetic Resonance Imaging scan from the International Brain Tumor Segmentation (BraTS) challenge.

Dataset 2: The non-medical data was sourced from the public SpaceNet satellite imagery dataset suite. The SpaceNet model uses a single satellite image from SpaceNet-Vegas as input.

FIGURE 2. An example of the SpaceNet-Vegas images used in our study. The ground truth annotations for buildings and other structures were professionally labeled.

Results: The Cost of Tiling

We systematically evaluated the effects of using tiling approaches vs. using the whole image for deep learning semantic segmentation, in both 2D and 3D configurations. Our results revealed substantial differences in the 2D U-Net architecture, both for the medical and the satellite data. Specifically, evaluations revealed improved results when inferring the model in the original, full-size 2D image, as compared with inferring in smaller image tiles. Furthermore, gradual increments in tile sizes showed a gradual improvement in results.

In regard to tiling in general, our research raised awareness of several critical issues:

  1. Tiling hyperparameters, which include tile size, offset, orientation, and overlap, can cause large variations in the prediction. This variance is not just limited to a translation less than the stride (i.e., the number of pixels the tile is shifted), but seems to be present even with translations of ±2 in each direction.
  2. Topologies with zero padding in the convolutional layers eliminate the translational variance of the topology.
  3. Methods that aggregate the individual predictions into a whole image prediction—such as averaging the predicted outcome pseudo-probability maps and rounding these predictions—can have a significant effect on the overall accuracy.
  4. Larger degrees of image context, including adding 3D information to the model and using larger tile sizes, improves model results in training and is less sensitive to hyperparameters during inference.

A Clear Conclusion: Where Possible, Avoid Tiling

Tiling methods, while sometimes necessary due to constraints in memory, often reduce accuracy. To create accurate, robust models, analysts need increased access to memory—either through improvements in hardware or through advances in high performance computing techniques. Where sufficiently large memory is available for training and inferencing, tiling methods are neither necessary nor desirable. Hence, tiling should be reserved for only those cases where the physical limitations of memory make it an absolute necessity. Further, when tiling must be used, researchers should be careful to investigate how the translational variance of the model affects predictions, and they should compare methods of tiling aggregation to determine the best ways to mitigate the variability inherent in tiling. For more information, read our paper: “Systematic Evaluation of Image Tiling Adverse Effects on Deep Learning Semantic Segmentation,” published in February.

Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at intel.com. Intel and the Intel logo are trademarks of Intel Corporation or its subsidiaries in the U.S. and/or other countries.

Other names and brands may be claimed as the property of others.

© Intel Corporation

8/31/23 Edit: Authors edited or added.

About the Author
Mary is the Community Manager for this site. She likes to bike, and do college and career coaching for high school students in her spare time.