Jetson - Self-Driving Toy Car (Part: 1)

Car Assembly, System Design and Basic AI Autopilot Motion 🤖🚗

In today’s article, we are going to begin a self-driving toy car series. The first part of the series will cover the car assembly and basic AI autopilot motion. We are going to start building an end-to-end vision self-driving system which will leverage Robotics, Computer Vision, and Machine Learning. By the end of this article, you will be able to assemble a self-driving toy car, make it learn how to drive, and finally let it operate fully autonomously like in the below video!

Car Assembly

To build a self-driving toy car, we need hardware that would allow us to control the car’s motion. More specifically, we need to be able to control its latitude (steering) and longitude (throttle) motion.

There are a bunch of options to start with. I recommend checking nvidia’s suggestions, donkey car docs or waveshare AI kit.

I used the last option and I can recommend it, as it provides an easy way to assemble necessary hardware and control a car’s throttle with a DC motor and a car’s steering with a servo motor.

(source: author)

Sensors

After building a car that can move, we need to make it able to sense the environment. There are multiple ways of doing that, but the most straightforward approach in the context of autonomous vehicles is to use a wide-angle front-facing camera.

(source: https://www.waveshare.com/IMX219-160-Camera.htm)

In the future, we could extend our sensor suite with sonar, lidar, or another camera, but as for now, let’s focus on a single wide-angle camera with 160° FOV.

Brain

Having our car being able to move and see, we need to bridge the gap between the hardware that we already built, and the software that we are going to build next.

In order to do that, we need a single-board computer (SBC). We are going to use NVIDIA’s Jetson Nano with quad-core ARM CPU, 4GB of RAM, and most importantly 128 CUDA cores GPU that would allow us to run neural networks in real-time.

(source: https://developer.nvidia.com/embedded/jetson-nano-developer-kit)

Track

Finally, we need a track for our car. My solution for that is a simple gym mat and some wooden bricks. Such an approach is a cheap, easy, and most importantly, flexible way of building a car track. The necessity of building a track every time from scratch forces to make it slightly differently every time, which impedes overfitting during the training phase. Also, with small wooden bricks possibilities of track configurations are almost limitless.

(source: author)

AI Autopilot

We humans often underestimate the complexity of things that we need to do every day, and driving a car is one such example. Considering a simple case of going from point A to point B, for us humans, it would be something relatively simple. We could do it almost without paying explicit attention to the driving task itself.

Somehow, we would ‘automatically know’, that to drive we would have to

  1. keep the car on the road
  2. keep the distance from other cars and obstacles
  3. look out for pedestrians
  4. look out for signs
  5. obey traffic rules
  6. go in the right direction
  7. etc.
(source: https://www.roboticsbusinessreview.com/unmanned/unmanned-ground/pbs-science-show-nova-shines-its-spotlight-on-self-driving-cars/)

These are just simplified examples of tasks that need to be taken care of in order to safely drive and a fully complete list would be much longer than the above one.

Now that we know that driving a car is very far from being trivial, let’s think about an approach to solve it.

Do we need to divide a driving task into smaller sub-tasks and tackle each one of them one by one, or can we just create an end-to-end system that does it all?

To better grasp the difference between these two approaches, let’s take a look at the difference between Machine Learning and Deep Learning explained in the below diagram.

(source: https://www.levity.ai/blog/difference-machine-learning-deep-learning)

In the sub-task approach (that corresponds to Machine Learning), human engineers tell the system what to look at (feature extraction), and in the case of self-driving cars, that would mean to look at for example lanes, signs, or other cars.

However, in the end-to-end approach (that corresponds to Deep Learning), human engineers are not in the loop, because they don’t tell the system explicitly what to look at, with the idea that the system is going to figure this out by itself, and that it will do that better than with human’s involvement.

In 2016, NVIDIA came up with research called End to End Learning for Self-Driving Cars, which argues that the end-to-end approach can be superior to the sub-task approach. They built an end-to-end vision system, fed it with 3 forward facing cameras, trained it on human driving and ultimately they were able to verify that the system correctly learned how to drive and stay on the road.

On the other hand, Tesla is using a hybrid system called HydraNet which has a shared general backbone, but its sub-parts are crafted by humans.

(source: https://www.tesla.com/autopilot)

It’s described in the below video by the Head of AI at Tesla, Andrej Karpathy.

So, which approach is better?

There is no single correct answer to this question, but one of the long-term goals of this project is to study this area and eventually find this out.

We are going to start with the end-to-end approach, and then, but only if necessary, we’ll add specific sub-tasks that should improve our car’s autonomous driving abilities.

System Architecture

Standard system architecture in Robotics applications looks as follows:

(source: author)

Sensing

Given that our environment is a race track, we need Jetson to sense it. We already decided to use a wide-angle camera for that. An example frame from such a camera may look like the below.

(source: author)

It’s a 224x224 RGB frame - big enough to contain necessary data and small enough so that its processing won’t take too much time.

But how can we make Jetson ‘understand’ such an image?

This leads us to the next phase - perception.

Perception

One of the approaches that could allow the computer to ‘make sense’ out of an image is by using Convolutional Neural Networks (CNNs). If you are completely new to CNNs, please check the below article first.

In our application, we are going to use CNNs to learn the mapping between camera frames and correct steering and throttle values. In order to do it, we are going to feed the network with camera frames annotated with correct values. After some time of training, our model will be able to predict steering and throttle values given new camera frames.

(source: author)

Going back to our example camera frame, it’s correct annotation might be

[0.3, 0.5] #slightly right, and half-way forward

where the first value is steering and the second one is throttle.

Steering ranges from -1.0 to 1.0, from fully left to fully right.

Throttle ranges from -1.0 to 1.0, from fully backward to fully forward.

Also, given that our system is going to be end-to-end by design, we are not going to include here any driving specific sub-tasks like lane finding for example (which I already did there, by the way). It would definitely be helpful, but the idea behind an end-to-end system is that we don’t have to tell our agent explicitly what to look for. However, if it decides that something is important, we hope that it will learn it implicitly.

Planning and Control

Being able to predict correct steering and throttle values given the camera frame that Jetson currently sees, we can make him act upon it. We are going to use Adafruit’s Servokit library for that, which is an easy interface allowing to control servos (usually for lateral movement) and motors (usually for longitudinal movement) with Python code.

Actuation

Finally, putting servos and motors in motion, we are making our Jetson move on the race track!

Learning How to Drive

Having our system already designed, we can proceed to implementing it and allowing Jetson to learn how to drive.

The full codebase is available at

feel free to check it and follow along.

1. Data Collection

Code

Our data collection pipeline is based on the concept of Behavioral Cloning, which is a method where a human’s cognitive skilled are captured and feed into the learning system, so they can be later reproduced.

(source: https://link.springer.com/referenceworkentry/10.1007%2F978-0-387-30164-8_69)

With the autopilot_data_collection notebook, we can drive a car with a gamepad, and while doing so we can record camera frames and corresponding steering and throttle values. Such data with ‘correct human-level’ driving is collected into a dataset.

POV data collection at 10 FPS (source: author)

2. Training

Code

Having dataset with human-level driving, we can proceed to train a CNN on them. Our initial model will be very simple.

camera_frame -> model -> [steering, throttle]

We’ll use a resnet18 architecture pretrained on imagenet which is proven to yield good results with a little number of parameters. We’ll pick Adam for optimizer and MSE for the loss function.

(source: author)

It’s not SOTA, but it’s relatively fast, especially after converting it to TensorRT for faster inference.

It’s possible to train on Jetson Nano, but I’ll recommend offloading data to a GPU Server for faster computing.

Our model made 95% of progress in the initial epoch on training. From that point, it was mostly fine-tuning. I recommend saving a model with the lowest validation loss to get the model that’s the best in generalization. In our case it was at epoch 42.

(source: author)

3. Testing

Code

Finally, having a well-trained model, we can proceed to the fun part - testing Jetson on track!

We’ll use a simple infinite while loop for that, where we:

  1. grab a camera frame
  2. pass it to the model
  3. apply the model’s output to the car controller
  4. repeat
while True:
start_time = time.time()

image = camera.read(
image = preprocess(image)
output = model_trt(image).detach().clamp(-1, 1).cpu().numpy().flatten()
steering = float(output[0])
throttle = float(output[1])
car.steering = steering
car.throttle = throttle

fps = 1 / (time.time() - start_time)
print("fps: " + str(fps) + ", steering: " + str(steering) + ", throttle: " + str(throttle), end="\r")

Such a pipeline runs at ~30 FPS which is crucial because we cannot afford to be late on the track as it may result in crashes or going out of bounds.

I tried bigger networks like resnet34 or resnet50, and even though they were giving better predictions, they were significantly slower (around 20 fps and 15 fps respectively). The lesson learned from that is that it’s better to have slightly worse predictions but at a faster rate so that system can recover from them instead of crashing.

Ultimately, Jetson successfully learned to autonomously drive our track in both directions!

What’s Next?

We showed that an end-to-end vision system can work in the simple self-driving application because it gave Jetson the ability to learn how to drive on a simple race track.

Now, we can push the boundaries even further and build more sophisticated tracks. Check Part 2 of the series, where we’ll improve the system, and train in new scenarios to allow Jetson to handle more complicated situations!

Questions? Comments? Feel free to leave your feedback in the comments section or contact me directly at https://gsurma.github.io.

And don’t forget to 👏 if you enjoyed this article 🙂.