PanopticBEV Examples

A Neural Network-Based Approach to Generate Bird's-Eye-View Panoptic Segmentation Maps Using Frontal-View Monocular Images

These examples demonstrate the performance of our PanopticBEV model on the KITTI-360 and nuScenes datasets. PanopticBEV is the first end-to-end learning approach for directly generating dense panoptic segmentation maps in the bird's eye view given monocular images in the frontal view. To learn more about our work, please see the Technical Approach section. View the demo by selecting a dataset from the drop down box below and click on an image to view the predicted BEV panoptic segmentation map.

If you are unable to view the images for the KITTI-360 and nuScenes datasets below, please disable your AdBlocker and refresh the page.

Technical Approach

What are Bird's-Eye-View Maps?

Bird's-Eye-View (BEV) maps portray the scene around an autonomous vehicle as if it were viewed by a bird overflying the autonomous vehicle. Such representations provide rich spatial context and are extremely easy to interpret and process. They also capture absolute distances in the metric scale which allow them to be readily deployed in applications such as trajectory generation, path planning, and behaviour prediction, where a holistic view around an autonomous vehicle as well as the distance to nearby obstacles is essential.

Traditionally, BEV maps are created using complex multi-stage paradigms that encapsulate a series of distinct tasks such as depth estimation, semantic segmentation, and ground plane estimation. These tasks are often learned in a disjoint manner which prevents the model from holistic reasoning and results in erroneous BEV maps. Further, existing algorithms only predict semantics in the BEV space, which restricts their use in applications where the knowledge of object instances is crucial. To address these aforementioned challenges, we present the first end-to-end learning approach to directly generate a dense panoptic segmentation map in the BEV, given a monocular image in the frontal view (FV). We incorporate several key advances such as the dense transformer module and the sensitivity-based weighting function into our PanopticBEV architecture, which improves its performance as well as its efficiency.

PanopticBEV Architecture

Figure: Topology of the proposed PanopticBEV model. The model consists of a modified EfficientDet backbone (in grey) that generates four feature maps with strides 4, 8, 16, and 32. Each feature stride is then independently transformed into the BEV by our novel dense transformer model (in green). The transformed BEV features are then fed into the semantic (in orange) and instance (in violet) heads, followed by an adaptive panoptic fusion module to generate the final panoptic segmentation map in the BEV.

The goal of our PanopticBEV architecture is to generate a coherent panoptic segmentation mask in the BEV given a single monocular image in the FV. Our PanopticBEV model follows the top-down paradigm and is composed of a shared backbone, a dense transformer module, a semantic head, an instance head, and a panoptic fusion module. A custom variant of the EfficientDet-D3 backbone, adapted to work on larger feature maps, forms the backbone of our network and outputs feature maps of four scales $$\mathcal{E}_{4}, \mathcal{E}_{8}, \mathcal{E}_{16}, \mathcal{E}_{32}$$. Each feature map is then transformed into the BEV using our dense transformer module, which consists of two distinct transformers to independently map the vertical and flat regions in the input FV image to the BEV. The transformed feature maps are then consumed by the EfficientPS-based semantic head and Mask-RCNN-based instance head to parallelly generate the semantic logits and instance masks. The semantics logits and instance masks are subsequently fused in the panoptic fusion module to generate the final BEV panoptic segmentation output.

Dense Transformer Module

Figure: Topology of our transformer module containing distinct vertical and flat transformers.

The design of our dense transformer module is based on the principle of how different regions in the 3D world are projected onto a perspective 2D image. Specifically, a column belonging to a flat region in the FV image projects onto a perspectively distorted area in the BEV, whereas a column belonging to a vertical non-flat region maps to an orthographic projection of a volumetric region in the BEV space. Accordingly, we employ two distinct transformers in our dense transformer module to independently map features from the vertical and flat regions in the FV to the BEV. Reflecting our observation, the vertical transformer expands the FV features into a volumetric lattice to model the intermediate 3D space before flattening it along the height dimension to generate the vertical BEV features. Parallelly, the flat transformer uses the IPM algorithm followed by our Error Correction Module to map the flat FV features into the BEV. The transformed vertical and flat features are then merged in the BEV space to generate the composite BEV feature map.

Sensitivity-Based Weighting

Figure: Plot of the sensitivity-based weight ($$w_{sens}$$) as a function of the position in the BEV space. The ego vehicle, depicted by the car, is located at the bottom of the image.

The sensitivity-based weighting function accounts for the varying levels of descriptiveness across the input FV image by intelligently weighting pixels in the BEV space. Owing to the perspective projection, the apparent motion in the input FV image for close regions is much larger than that of farther ones when a point in the 3D world is moved by a unit distance. This disparity makes differentiating between small changes in distance much more difficult for farther regions as compared to closer ones, resulting in high uncertainty in the distant regions. We address this problem by proposing a sensitivity-based weighting function to up-weight pixels belonging to far-away regions. This up-weighting allows the network to focus on the distant regions which improves the overall model accuracy. The sensitivity-based weighting function is defined as $S = \frac{\sqrt{f_x^2 z^2 + (f_x x + f_y y)^2}}{z^2} \qquad w_{sens} = 1 + \frac{1}{log(1 + \lambda_s S)}$ where $$f_x, f_y$$ denote the focal length of the camera in terms of pixels, $$x, y, z$$ represent the 3D position of the point of interest, and $$\lambda_s$$ is a user-defined constant.

Bird's-Eye-View Panoptic Segmentation Datasets

We introduce two new BEV panoptic segmentation datasets for autonomous driving, namely, KITTI-360 PanopticBEV and nuScenes PanopticBEV that provides panoptic BEV ground truths for the KITTI-360 and nuScenes datasets respectively. KITTI-360 PanopticBEV consists of 9 scenes with a total of 64071 FV-BEV image pairs, while the nuScenes PanopticBEV dataset consists of a total of 850 scenes with a total of 34149 FV-BEV image pairs. We provide annotations for 7 'stuff' and 4 'thing' classes for the KITTI-360 PanopticBEV dataset, and annotations for 6 'stuff' and 4 'thing' classes for the nuScenes PanopticBEV dataset. We also publicly release the code used to generate the BEV datasets to promote FV-BEV image pair creation for other large-scale urban datasets.

The data is provided for non-commercial use only. By downloading the data, you accept the license agreement which can be downloaded here. If you report results based on the KITTI-360 PanopticBEV or nuScenes PanopticBEV datasets, please consider citing the paper mentioned in the Publications section.

Code

A software implementation of this project based on PyTorch can be found in our GitHub repository for academic usage and is released under the GPLv3 license. For any commercial purpose, please contact the authors.