The RATP Smart Systems Datalab, data serving your daily life

Transport mode recognition using geolocation data

Written by Asma | Sep 5, 2023 3:00:45 PM

One of our challenges at RATP Smart Systems is to build a MaaS (Mobility As A Service) platform. Mobility as a Service is the integration of various forms of transport services into a single application. To meet a customer’s request, a MaaS operator presents a range of transport options, be they public transport, ride-, car- or bike-sharing, scooter, kick-scooter, walking, taxi or car rental/lease/pooling, or a combination thereof. For the user, MaaS can offer added value through the use of a single application to provide access to mobility, with a single payment channel instead of multiple ticketing and payment operations. For users, MaaS should provide the best value proposition, by helping them meet their mobility needs and making multimodal journeys easy to handle.

One of the key points is that designing MaaS applications have to be data-driven. You can only push to a user the best value proposition if you personalize the results according to the inferred preferences.

Our first use-case consists in identifying an individual’s transport modes (bus, car, rail, static, walk) from his/her geolocation data. Our aim is to identify the transport modes used by each individual throughout the day, the moment of transitions between modes and the time spent on each mode. For this particular use-case of geolocation data, we use an approach based on stacking, and assembling.

In practice, we use the daily mobility data for each individual and the transport modes specificities (road, bus and rail networks). The daily mobility data consists in geolocated points (timestamp, latitude, longitude) recorded approximately every 2 seconds for each individual (Figure 1).

From these raw data, we extract many characteristics of the movements (speed, bearing, acceleration, distances to networks, etc.).

Figure 1 : Geolocated points of a path to and from Paris.

1. The dataset

We approach this as a supervised machine learning classification problem. The labelled dataset is composed of around 200 trajectories spread over 8 days. The raw dataset is far larger but we are still struggling with the correct labelling of these data. These trajectories are a very small subset manually labelled by ourselves.

We distinguish car transport, walk, bus and rail. We add another class, “Static” when someone stays several hours in the same location. Static sequences are often noisy because collecting GPS data can often give erroneous measurement because as when the phone is indoors, the GPS reception can be faulty and inferred positions can be spurious.

2. Features engineering

As usual with machine learning projects, one of the first steps is features engineering and data transformation. We create approximately 180 features from timestamp, latitude and longitude data. The labelled data is enriched by 180 features characterizing the movement. Those features are mainly aggregate, rolling window, different speed features and bearing features.

3. Network data

Obviously, one of the key points is that we know that someone travelling with his/her car must follow a road while someone travelling by metro must follow a railway. We want to add this information thanks to external datasets (i.e. the topology of the various transport networks)

In order to do this, we use data from OpenStreetMap for streets and roads of a given region. For transport networks (both bus and rail), we use open data given by public transport network operators.

Figures 3 & 4 : present respectively car and rail network features calculated for the example path. The size of a geolocated point indicates the probability that this point is on the considered network.

4. Machine learning models

Prevision.io is a french start-up that allows to manage the whole AI model lifecycle. Their platform contains a collection of supervised algorithms, ranging from basic logistic regression to some of the latest boosting algorithms like Xgboost, Catboost and Lightgbm. This platform simplifies the life of data scientists.

You just feed in the dataset for training and validation, select the metric, then the platform :

  • constructs the features associated with it
  • chooses the best algorithm supported (tree, regression, boosting, perceptron, …)
  • chooses the best parameters for the algorithms

For this use-case, as a classification problem, we choose to optimize multinomial log-loss. After benchmarking the algorithms, the platform selected Xgboost, a popular boosting library as the best for the purpose. The model is trained with the training subset and then validated on the test subset. This model assesses, for each point of a trajectory, the probability of belonging to each of the 5 modes (bus, car, rail, static, walk).

At this step we have predicted transport modes, but when visualising prediction we see some problems occurring. By visualizing our prediction on a map, we can find a common pattern of misclassification : for example we can see on Figure 5 some isolated points predicted as “walk” but located between 2 continuous segments of car.

To solve this misclassification problem, we apply a second neural network algorithm which smooths xgboost failed predictions. We train a Recurrent Neural Network on the predicted probability from the first xgboost algorithm. With this method, commonly known as Stacking (see Data Science competition from Kaggle) we smooth some part of our misclassified points by Xgboost. The next figure shows how the Xgboost prediction model was corrected by the Recurrent Neural Network.

Figure 5 : presents the results of transport modes recognition on the example path. Blue and skyblue colors reprensent respenctively “walk” and “car” modes. The figure shows a common pattern of misclassification: some isolated points predicted as “walk” but located between two continuous segments of car. Figure 6 : shows how the Xgboost prediction model was corrected by the Recurrent Neural Network.

5. Conclusion

Automating transport mode recognition is an ongoing challenge but the potential benefits justify the work invested to tackle it and we are pleased with the results obtained so far.

© Nikita Loukachev, Asma Ben Said, Remy Reche