Jetson - Self-Driving Toy Car (Part: 2)

Computer Vision Sensing & CNN Perception Improvements 🤖🚗

In today’s article, we are going to improve Jetson’s sensing and perception abilities with Computer Vision and Machine Learning techniques. It would allow our toy car to learn how to handle new cases going far beyond the simple path following. By the end of this article, you’ll see how a self-driving toy car can learn how to take correct turns on crossroads, follow the driveable path, and stop when the road ends.

If you haven’t already checked the first part of the series, please take the time to do it now:

Also, feel free to check the corresponding codebase and follow along:

Sensing Improvements

The robustness of the robotics’ systems is often heavily dependent on its sensing capabilities. In other words, the better a robot senses its surrounding environment, the better potentially it can act upon it.

It’s no different in our self-driving toy car project, and we are going to focus on our most important sensor - the camera.


Currently, the front-facing camera is Jetson’s only source of information about its environment. In Part 1 of the series, it was a 160° wide-angle camera. It was getting the job done, but we could do better if our robot could see more. One of the possible approaches to that problem could be to add more cameras.

This is for example how the sensor suite looks like at Tesla vehicles.


Trying to imitate the above sensor suite would definitely be overkill for our toy car project, but it doesn’t mean we couldn’t improve our initial camera setup.

Instead of adding more cameras, which would add a processing overhead for our (already very busy) computer, we can replace our front-facing camera with the one with a bigger field of view of 200° instead of 160°.

Apart from that, we can also slightly alter camera placement so that Jetson could see more of the road and less of the surroundings which are usually more of a distraction than a valuable source of information about the driving task.

This is how the comparison of the captured frames looks like:

160° FOV (source: author)
200° FOV (source: author)

We can see that the second frame is a superset of the first frame that most importantly shows the area just in front of the bumper and on its sides. Having more information about the environment packed in the same amount of pixels (224x224x3) should improve the following perception stage.

This is how Jetson looks like now with a new camera and a new chassis.

(source: author)

Perception Improvements

Having more valuable pixels flowing into the system should already improve its performance, but it doesn’t mean we should stop there. Our end-to-end vision system is still in the very early stage of development and there is still potential to make it more robust.


The core of the visual perception system lays in the Convolutional Neural Network with the resnet18 backbone.


Final Fully Connected Layers

For the simple path following we did in the previous part of the series, resnet18’s capabilities were enough to handle the corresponding relatively simple dataset.

However, we are taking one step forward and we are going to extend our dataset with additional, more complex scenarios. In order for the network to be able to handle them, we need to add additional layers/neurons.

Being inspired by Nvidia’s end-to-end network, and most specifically by the stack of fully connected layers after the convolutional backbone,


we are going to add two additional fully connected layers at the end of the network. = torchvision.models.resnet18(pretrained=pretrained) = torch.nn.Sequential(
torch.nn.Linear(, out_features=128),
torch.nn.Linear(in_features=128, out_features=64),
torch.nn.Linear(in_features=64, out_features=OUTPUT_SIZE)

Adding more layers, thus increasing the size of the network is a double-edged sword. On the one hand, it allows the network to learn more complex patterns, but on the other hand, if the network is too big, it may lead to the undesired overfitting, where the network instead of learning the patterns and generalizing, learns the ‘correct answers’ for the training set.


One of the possible solutions for the overfitting problem is Dropout. Dropout is a regularization technique that with a given probability removes connections between neurons. It forces the network to spread the learned information across multiple neurons, instead of single ones, which would prevent the network from simply ‘memorizing’ the training dataset.

(source: Srivastava, Nitish, et al. ”Dropout: a simple way to prevent neural networks from
overfitting”, JMLR 2014)

‘Can’t rely on any, one feature so have to spread out the weights’ Andrew Ng

Finally, our network looks like this: = torchvision.models.resnet18(pretrained=pretrained) = torch.nn.Sequential(
torch.nn.Linear(, out_features=128),
torch.nn.Linear(in_features=128, out_features=64),
torch.nn.Linear(in_features=64, out_features=OUTPUT_SIZE)


One of the goals of this part of the Jetson series is to make it capable of learning new behaviors.

In order to do that, we need to provide appropriate data.

The flow is very simple and straightforward. For example, if we want to make Jetson stop when approaching the following cones,

(source: author)

we need to annotate such images with the throttle value of ‘0.0’.

Similarly, if we want to make it learn to pick the correct lane when seeing following turn signs,

we need to provide appropriate steering values.

What’s important here, is that in the end-to-end vision system we built, we don’t program any rules or perform any explicit cone or left/right sign detection.

However, it doesn’t mean that the system is not learning this implicitly by building such ‘detectors’ internally. It most likely does, and we are going to go deeper into that area in Part 3 of the series, stay tuned!


After training our CNN model with new datasets, we could go straight into the real-world tests, and make Jetson face the track, but to speed up the experimentation process of picking correct hyperparameters, and model architecture, we can do some quick tests before that, just after the training ends.

Having a set of correctly annotated cases that are not in the training set, we can perform a simple test to quickly visualize how our model would perform in the following scenarios.

An example of such a test may look as follows:

(source: author)


After collecting around 80k of annotated samples and multiple GPU hours spent on training and experimentation, we finally came up with a model that smoothly solves all test scenarios!

But, before we proceed to the final results, let’s see a list of some performance tips I’ve collected along the way:

  • make sure that your compute unit operates at the max performance, for jetson nano run sudo nvpmodel -m 0 && sudo jetson_clocks
  • don’t run the control loop from the jupyter notebook, use a python script to lower the latency
  • limit the number of operations performed in the control loop
  • use Tensor RT for faster inference
  • set your model to FP16 or even INT8 if possible

Now it’s time to see the final results!

POV path following (source: author)
stopping (source: author)

What’s Next?

Being able to teach Jetson to operate in scenarios that go far beyond simple path following is both impressive and mysterious. How did it actually learn to do these things without explicit instructions and hardcoded heuristics?

We are going to answer this question in Part 3 of the series where we are going to visualize and explore the internals of Convolutional Neural Networks. Stay tuned!

Questions? Comments? Feel free to leave your feedback in the comments section or contact me directly at

And don’t forget to 👏 if you enjoyed this article 🙂.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store