Technical Approach

Robustly classifying ground infrastructure such as roads and street crossings is an essential task for mobile robots operating alongside pedestrians. While many semantic segmentation datasets are available for autonomous vehicles, models trained on such datasets exhibit a large domain gap when deployed on robots operating in pedestrian spaces. Manually annotating images recorded from pedestrian viewpoints is both expensive and time-consuming. To overcome this challenge, we propose TrackletMapper, a framework for annotating ground surface types such as sidewalks, roads, and street crossings from object tracklets without requiring human-annotated data. To this end, we project the robot ego-trajectory and the paths of other traffic participants into the ego-view camera images, creating sparse semantic annotations for multiple types of ground surfaces from which a ground segmentation model can be trained. We further show that the model can be self-distilled for additional performance benefits by aggregating a ground surface map and projecting it into the camera images, creating a denser set of training annotations compared to the sparse tracklet annotations. We qualitatively and quantitatively attest our findings on a novel large-scale dataset for mobile robots operating in pedestrian areas.

Overview of the System
Visualization of our automatic annotation pipeline. In step I, we leverage RGB images, LiDAR point clouds, ego-poses, and an object tracker to project the ego-poses and the observed tracklets into camera images, generating a sparsely-annotated semantic segmentation dataset $\mathcal{D}_0$. In step II, we use a frozen segmentation model trained on $\mathcal{D}_0$ to obtain semantic annotations and aggregate them into a global semantic surface map. Finally, we project this map into the camera images and obtain denser and more consistent annotations, denoted as $\mathcal{D}_1$.
Overview of the System
Exemplary illustrations of maps produced with our approach. The top row shows the respective ground-truth map and the bottom row shows the aligned semantic surface maps obtained with our aggregation approach. Black color denotes not annotated / unobserved areas.

Freiburg Pedestrian Scenes Dataset


We present the Freiburg Pedestrian Scenes dataset recorded with our robot platform. During each data collection run, the robot is teleoperated through semi-structured urban environments and moves alongside pedestrians on sidewalks, pedestrian areas, and street crossings.

Overview of the System
Each data collection run consists of time-synchronized sensor measurements from a Bumblebee Stereo RGB camera, a Velodyne HDL 32-beam rotating LiDAR scanner, an IMU, and a GPS/GNSS receiver. Furthermore, we provide Graph-SLAM poses. In total, the dataset comprises 15 highly diverse and challenging urban scenes. The data collection runs cover a wide range of illumination conditions, weather conditions, and structural diversity.

The dataset was recorded over the course of multiple years in the city of Freiburg, Germany. To evaluate our approach, we manually annotated 50 ego-view RGB images from five data collection runs not included in the training set. We also hand-annotated large sections of the traversed areas with a semantic BEV map in order to be able to compare aggregated and ground-truth maps.
Overview of the System
Visualization of the map-referenced manual ground truth surface annotations.


Please cite our works if you use the Freiburg Freiburg Pedestrian Scenes dataset or report results based on it.

  title={TrackletMapper: Ground Surface Segmentation and Mapping from Traffic Participant Trajectories},
  author={Z{\"u}rn, Jannik and Weber, Sebastian and Burgard, Wolfram},
  booktitle={6th Annual Conference on Robot Learning},