I thought it was telling that Andrej immediately "reframed" the question because Lex asked the "wrong question". This is a classic evasion technique one learns from experience and/or media training. Lex's comment immediately after was a clever and gentle dig at Andrej's response.
It seemed like all the "full cost" negatives Andrej mentioned were related to Tesla's ability to execute, and not what would actually produce better results. Tesla would have to be able to reliably procure parts, write reliable firmware, create designs and processes that won't increase unexpected assembly line stops, etc.
Regarding results, the best Andrej can do is, "In this case, we looked at using it and not using it, and the delta was not massive." In other words, the results are better, but not enough to make up for the fact that Telsa can't support additional sensors without incurring a prohibitive amount of additional risk to Tesla. Risk to passengers doesn't appear to be a consideration.
Q: "Does [removing some sensors] make the perception problem harder, or easier?"
(note, this is literally what Lex asked, your restatement is misleading)
A: [paraphrasing] "Well more sensor diversity makes it harder to focus on the thing that I believe really moves the needle, so by narrowing the space of consideration, I think we'll get better results"
Karpathy might not be telling the truth, I don't know. But it's a much more credible pitch than you make it sound, because it's often true that you can deliver better by focusing on a smaller number of things. Engineering has always been about tradeoffs. Nobody is offering Karpathy infinite money plus infinite resources plus infinite time to do the job.
Again, I'm not saying Karpathy is honest or correct. I'm saying that the rephrasings in this comment and this thread are hilariously unfair.
It is definitely a clever marketing pitch, as there is plenty of evidence to back up that LIDAR makes self-driving cars significantly safer. However, despite the hype, Teslas aren't really self driving cars at the moment, so it seems an acceptable commercial decision wrapped up in a clever sales pitch.
That's also true for high resolution maps. The question is whether you're solving for self-driving on highways or a handful of mapped city centers or whether you want to solve for the real thing. Tesla is all-in on FULL self driving, and most other companies are betting on driver assistance or gps-locked self-driving. If Tesla can get FSD to work in the next couple of years then they're vindicated. If FSD requires a weak form of generalized intelligence (plausible) then FSD isn't happening anytime soon and investing in more sensors and GPS maps is correct.
High resolution maps do not give you an accurate 3D representation of nearby objects.
Our brains do an amazing job interpreting high resolution visual data and analyzing it both spatially and temporally. Our brains then take that first analysis and apply a secondary, experiential, analysis to further interpret it into various categories relevant to the current activity.
What I’ve seen from Tesla so far indicates to me that FSD shouldn’t be enabled regardless of what sensor package they’re using, let alone based on camera data only. They need to solve their ability to accurately observes their surroundings first, especially temporally. Things shouldn’t be flashing in and out that have been clearly visible to the human eye the entire time. Additionally, this all ignores the experiential portion of driving. When most people approach something like a blind driveway or crosswalk obscured by a vehicle (a dynamic, unmapped, situation), they pay special attention and sometimes change their driving behavior.
I think they’re talking about number of different systems doing the same thing. Have one system doing it that is sufficiently abstracted away from a common set of hardware vs various systems competing for various aspects of control.
Sorry, it's your opinion that researchers and/or engineers working on DL or Bayesian methods work better when they're distracted by many diverse tasks? What?
No, it's my opinion that in linear regression an inordinate amount of time is spent with feature selection and ensure there's no correlations among the features. When data is cheap in both X and Y, winnowing down X is a lot of work.
Munro’s cost breakdown is much more informative in just how much it’ll save in terms of parts/labor. https://youtu.be/LS3Vk0NPFDE
In general the ‘harm to consumers’ is really just making it more likely they damage the car in a parking lot or their garage, which tells you where their priorities are (sales, Automotive gross profit). Assuming occupancy network works, the only real blind spot left is if something in front of the car changes in between it turning off and on (assuming occupancy will 'remember' the map around it when it goes to sleep).
Also, Tesla’s strategy for safety is seemingly “excel in industry standard tests, ie. IIHS and EuroNCAP”, so this might be a case of the measure becoming a target.
This thread is unhelpfully mixing radar and ultrasonic sensors. Ultrasonic sensors, as your video explains, are primarily used as a parking aid; they are tuned for too low a distance to be helpful in just about any kind of driving scenario at speed.
Meanwhile, radar is the principal sensor used in systems like automatic emergency braking across the industry. It has no intersection with any of the parking stuff because it generally has to ignore stationary objects to be useful (hence the whole "Teslas crashing full speed into stopped vehicles" thing).
> ”The first famous autopilot crash was because a white semi-truck was washed out by the sun and confused for an overhead sign.
That's literally trivial for a car with radar to detect.”
That crash occurred on a car which was using radar. Automotive radar generally doesn’t help to detect stationary objects.
Further, that crash occurred on a vehicle with the original autopilot version (AP1), which was based on Mobileye technology with Tesla’s autopilot software layered on top. Detection capabilities would have been similar to any vehicle using Mobileye for AEB at the time.
I find very strange the claim that a moving doppler (pulsed doppler?) radar 'generally doesn't help to detect stationary objects'. I mean if the car is moving, it generates a doppler shift on all objects moving at a different speed, right?
Maybe it's difficult for reasons of false alarm detection (too many stationary objects that are not of interest) but you can get very good results with tracking (curious about these radars' refresh rate), STAP, and classification/identification algorithms, especially if you have a somewhat modern beamformed signal (so, some kind instant spatial information). Active-tracking can also be of help here if you can beamsteer (put more energy, more waveform diversity on the target, increase the refresh rate). Can't these radars do any of those 'state of the art 20 years ago' stuff?
There's something I don't get here and I feel I need some education...
Source: have worked with some of the (admittedly last-gen) automotive RADAR chips, NXP in particular.
The issue is the number of false positives, stationary objects need to be filtered out. Something like a drainage grill on the street generates extremely strong returns. RADAR isn't high enough resolution to differentiate the size of something, you only have ~10 degree resolution, and after that you need to go by strength of the returned signal. So there's no way to differentiate a bridge girder or a railing or a handful of loose change on the road from a stationary vehicle. On the other hand, if you have a moving object, RADAR is really good at identifying it and doing adaptive cruise control etc.
RADAR can have high(er) angular resolution with (e.g.) phased arrays (linear or not) and digital beamforming. I guess it's the way the industry works and it wants small cheap composable parts, but using the full width of the car for a sensor array you could get amazing angular accuracy, even with cheap simple antennas. MIMO is also supposed to give somewhat better angular accuracy, since you can perform actual monopulse angular measurement (as if you had several independent antennas). There's even recent work on instant angular speed measurement through interferometry if you have the original signals from your array.
And with the wavelengths used in car RADARs you could get far down on range resolution, especially with the recent progress on ADCs and antenna tech.
I'm not saying you're wrong, you're describing what's available today (thanks for that).
Wondering when all this (not so new) tech might trickle down to the automotive industry... And whether there's interest (looking at big fancy manufacturers forgoing radar isn't encouraging there).
In theory a big phased array of cheap antennas is cheap, in practise not because you need to have equal impedance routing to all of the antennas, which means you need them all to be roughly equidistant to the amplifier. You could probably get away with blowing it up to the size of a large dinner plate, but then you also need a super stiff substrate to avoid flexing, and you need to convince manufacturers that they should make space for this in their design language without any metallic paint or chromed elements in front.
Which car brand do you think would take up these restrictions, and which customer is then going to buy the car with the big ugly patch on the front?
Modern phased arrays can have independent transmitters (synchronized digitally or with digital signal distribution) or you can have one 'cheap and stupid' transmitter and many receivers, doing rx beamforming, and as for complexity you mostly 'just' need to synchronize them (precisely). The receivers can then be made on the very cheap and you need some signal distribution for a central signal processor.
Non-linear or sparse arrays are also now doable (if a bit tricky to calibrate) and remove the need for complete array or rigid substrate or structure.
If you imagine the car as a multistatic many-small-antennas system there's lots that could be done. Exploding the RADAR 'box' into its parts might make it all far more interesting.
I'll admit I'm way over my head on the industrial aspects, so thanks for the reality check. Just enthusiastic, the underlying radar tech has really matured but it's not easy to use if you still think of the radar as one box.
I know even for the small patch antennas we were looking at, the design of the waveguides was insanely complicated. I can't imagine blowing it up to something larger with many more elements.
If you wanted separated components to group together many antennas I suspect the difficulty would be accurate clock synchronization what with automotive standards for wiring. I'm still not sure I understand how they can get away without having rigid structures for the antennas, but this would be a critical requirement because automotive frames flex during normal operation.
Cars are also quite noisy RF environments due to spark plugs.
I guess what you're speaking of will be the next 10-20 years of progress for RADAR systems as the engineering problems get chipped away at one at a time.
There's also a legitimate harm to consumers with such a large radar array in the front bumper. Because even a minor fender bender could total a $50k car.
So the car would be very difficult to sell since few people are willing to pay much higher insurance premiums just for that.
I've heard people on the internet claim that, in automotive radar the first thing they do when processing the signal is discard any stationary objects. Apparently this is because the vast majority of the time it's a sign or overhead gantry or guard rail - any of which could plausibly be very close to the lane of travel thousands of times per journey - and radar doesn't provide enough angular resolution to tell the difference.
Personally I've never seen these claims come from the mouth of an automotive radar expert, and many cars do use radar in their adaptive cruise control, so I present it as a rumour, not a fact :)
Indeed, my VW which uses a forward looking radar has signaled several times for stationary objects. In fact, the one time it literally stopped an accident was for a highway that suddenly turned into a parking lot. People keep repeating BS said by tesla and tesla apologists for why their cars run into stopped things and others seem to have less of a problem with it.
> I find very strange the claim that a moving doppler (pulsed doppler?) radar 'generally doesn't help to detect stationary objects'. I mean if the car is moving, it generates a doppler shift on all objects moving at a different speed, right?
I’m in the same boat as to not understanding why, but from what I have read the problem indeed isn’t that it doesn’t detect them, it’s that there are too many of them, and nobody has figured out how to filter out the 99+% of signals you have to ignore from the ones that may pose a risk, if it’s doable at all.
I think that at last part of the reason is that spatial resolution of radar isn’t great, making it hard to discriminate between stationary objects in your path and those close to it (parked cars, traffic signs, etc). Also, some small objects in your path that should be ignored such as soda cans with just the ‘right’ orientation can have large radar reflections.
Especially when most car radars are FMCW radars. They not only do know the speed, they also know the distance.
Some of the newest car radars can do some beam formimg, but not all.
Most models have multiple radars pointing in multiple directions as that's cheaper than AESA.
Only just recently have "affordable" beamformer's come to the market. And those target 5G basestations.
So the spec in most K/Ka-band models starts at 24.250GHz, where the 5G band starts.
While the licence free 24GHz band that the radars use is 24.000-24.250GHz.
If this was not bad enough there has been consistent push from regulators to get the car radars on the less congested 77GHz band.
And there's even less afforable beamformers for that band.
Might be time for some state sponsorship to have the beamforming asics, fpga designs for these bands. Although I might be missing something: once you're back down in your demodulated sampling frequency, your old beamformer should suffice? Or are we talking 'adc+demodulator+filter+beamforming' asic?
Not a fan of Tesla removing the sensors but a vehicle on a highway that isn’t moving the same direction as the car is not “trivial” with radar. No AEBs that use radar look for completely stopped objects after a certain speed because the number of false positives is so high.
So, yes, cars that are programmed to have AEB: perform well at AEB and not other tasks. We are in agreement here. (I even agree with you that those cars use Radar for AEB).
Now, where we disagree is you implying that cars with AEB-level radar (literally $10 off-the-shelf parts with whatever sensor fusion some MobilEye intern dreams ups) are somehow the same as self-driving cars (the goal of Tesla Autopilot).
Every serious self-driving car/tractor-trailer out there uses radar as a component of its sensor stack because Lidar and simple imaging is not sufficient.
And that's the point I was trying to make - we agree it's trivial for radar to find things they just need sensor fusion to confirm the finding and begin motion planning. This is why a real driverless car is hard despite what Elon would like you to believe. There is no one sensor that will do it. Full stop.
And this cuts to the core of why Tesla is so dangerous. They are making a car with AEB and lane-keeping and moving the goal posts to make people (you included) think that's somehow a sane approach to driverless cars.
> Yet somehow, humans can drive cars with just a pair of optical sensors
A pair of optical sensors and a compute engine vastly superior to anything that we will have in the near future for self-driving cars.
Humans can do fine with driving on just a couple of cameras because we have an excellent mental model (at least when not distracted, tired, drunk, etc.). Cars won't have that solid of a mental model for a long, long time, so sensor superiority is a way to compensate for that.
The optical sensors are just a small part of the human (and animal in general) vision system. A much bigger component is our innate (evolutionarily acquired) understanding of basic mechanics, simple agent theory, and object recognition.
When we look at the road, we recognize stuff in the images we get as objects, and then most of the work is done by us applying basic logic in terms of those objects - that car is off the side of the road so it's stationary; that color change is due to a police light, not a change in the composition of objects; that small blob is a normal-size far-away car, not a small and near car; that thing on the road is a shadow, not a car, since I can tell that the overpass is casting it and it aligns with other shadows.
All of these things are not relying on optics for interpreting the received image (though effects such as parallax do play a role as well, it is actually quite minimal), they are interpreting the image at a slightly higher level of abstraction by applying some assumptions and heuristics that evolution has "found".
Without these assumptions, there simply isn't enough information in an image, even with the best possible camera, to interpret the needed details.
> "A much bigger component is our innate (evolutionarily acquired) understanding of basic mechanics, simple agent theory, and object recognition. ... they are interpreting the image at a slightly higher level of abstraction by applying some assumptions and heuristics that evolution has "found"."
Of course, and all this is exactly what self-driving AIs are attempting to implement. Things like object recognition and understanding basic physics are already well-solved problems. Higher-level problem-solving and reasoning about / predicting behaviour of the objects you can see is harder, but (presumably) AI will get there some day.
Putting all of these together amounts to building AGI. While I do believe that we will have that one day, I have a very hard time imagining as the quickest path to self-driving.
Basically my contention is that vision-only is being touted as the more focused path to self-driving, when in fact vision-only clearly requires a big portion at least of an AGI. I think it's pretty clear this currently means this is not a realistic path to self-driving, while other paths to self-driving using more specialized sensors seem more likely to bear fruit in the near term.
And Tesla lacks that, so therefore they ought not simply rely on cameras and ought use extra auxiliary systems to avoid danger to their consumers, they are not doing this because it reduces their profit margins, alas, this hn thread
> Yet somehow, humans can drive cars with just a pair of optical sensors (mounted on a swivelling gimbal, of sorts).
In fairness, humans have a lot more than just optical sensors at their disposal, and are pretty terrible drivers. We've added all kinds of safety features to cars and roads to try to compensate for their weaknesses, and it certainly helps, but they still make mistakes with alarming regularity, and they crash all the time.
When you have a human driver, conversations about safety and sensor information seem so straightforward. The idea of a car maker saving a buck by foregoing some tool or technology at the expense of safety is largely a non-starter.
What's weird is, with a computer driver, (which has unique advantages and disadvantages as compared to a human driver) the conversation is somehow entirely different.
> We've added all kinds of safety features to cars and roads to try to compensate for their weaknesses
This is a super important point. Whenever self-driving cars comes up in conversation it's like, "we're spending billions of dollars on self-driving cars tech, but what if we just, idk, had rails instead of roads". We're putting all the complexity on the self-driving tech, but it seems pretty clear that if we helped a little on the other end (made driving easier for computers), everything would get better a lot faster.
> In theory, a sufficiently capable AI should be able to drive a car at least as well as a human can using the same input: vision.
In theory, cars should be use mechanical legs instead of wheels for transportation, that's how animals do it. In theory, plane wings should flap around, that's the way birds do it. My point being: the way biology solved something may not always be the best way to do it with technology.
> ”In theory, cars should be use mechanical legs instead of wheels for transportation, that's how animals do it.”
Wheels and legs solve different problems. Wheels aren’t very useful without perfectly smooth surfaces to run them on. If roads were a natural phenomenon that had existed millions of years ago, then isn’t it plausible that some animals might have evolved wheels to move around faster and more efficiently?
GP was stating that "two cameras mounted 15cm apart on a swivel slightly left of the vehicle center of geometry" has proven to be a _sufficient_ solution, not necessarily the best solution.
>Yet somehow, humans can drive cars with just a pair of optical sensors (mounted on a swivelling gimbal, of sorts).
This is wrong and I was surprised to hear them say it was enough in the video.
We don't have car horns and sirens for your eyes. You will often hear something long before you see it. This is important for emergency vehicles. Once you hear it, a good driver will immediately slow down and pull to the side, or delay movement to give space for the vehicle.
Does this mean self driving vehicles can't detect emergency vehicles until they appear on camera? That's not encouraging.
>Once you hear it, a good driver will immediately slow down and pull to the side, or delay movement to give space for the vehicle.
Robotically performing an action in response to single/few stimuli with little consideration for the rest of the setting and whether other responses could yield more optimal results precludes one from ever being a "good" driver IMO.
"See lights, pull over" is not going to cut it. See any low effort "idiot drivers and emergency vehicles" type youtube compilation for examples of why these sorts of approaches fall short.
That might have something to do with the general intelligence prediction supercomputer sitting between the ears. If Tesla is saying they won't have real (not just an 80 percent solution that they then lie and say is complete) self driving until they develop an AGI, I agree
Optical sensors, an innate understanding of they works around them that they are previewing.
And most importantly, a social understanding of what other humans around them are likely to do.
Our two eyeballs plus brain is SO MUCH MORE than just two mediocre CCDs.
Our eyes provide distance sensing through focusing, the difference in angle of your two eyes looking at a distant object, and other inputs, as well as having incredible range of sensitivity, including a special high contrast mode just for night driving. This incredibly, literally unmatched camera subsystem is then fed into the single best future prediction machine that has ever existed. This machine has a powerful understanding of what things are (classification) and how the world works (simulation) and even physics. This system works to predict and respond to future, currently unseen dangers, and also pick out fast moving objects.
Two off the shelf digital image sensors WILL NEVER REPLACE ALL OF THAT. There's literally not enough input. Binocular "vision" with shitty digital image sensors is not enough.
Humans are stupidly good at driving. Pretty much the only serious accidents nowadays are ones where people turn off some of their sensors (look away from the road at something else, or drugs and alcohol) or turn off their brain (distractions, drugs and alcohol, and sleeping at the wheel).
Yes, a "pair" of optical sensors. Tesla is at a disadvantage compared to humans -- they do not do stereoscopic imaging, which makes distance of objects less reliable -- they try to infer distance from a single flat image. Humans having two sensors pointed in the same direction gives us a very reliable way of determining distance (up to a relevant distance for driving at least).
Interestingly, even people with missing stereoscopic vision are allowed to drive. We don't require depth perception to drive. The assumption is that they can compensate.
Binocular vision isn't even the only source of depth information available to humans. That's why someone missing an eye can still make reasonable depth estimations.
Isn't this a bit like saying we can do better than fixed-wing aircraft, because birds can flap their wings? With sufficiently advanced material science, flapping-wing human flight too, is possible. But that doesn't mean Boeing and Cessna are misguided.
But that's not how people drive. They use their ears, they move their head around to generate parallax, they read the body-language of other drivers, they make eye-contact at intersections, they shift position to look around pillars, or stoop to see an inconveniently placed stop light. Fixed forward cameras do none of that.
But if the radar just sees a static object and can't tell if it's an overhead sign or a car, and the camera vision is too washed out, how would sensor fusion help in your example?
Perhaps stop cheaping out on the cameras and procure those with high dynamic range. Then again those may be "expensive and complicate the supply chain with for a small delta"
A human driver slows down and moved their head around to get a better view when the glare from the sun is too strong to see well. I’d expect a self driving car to similarly compromise on speed for the sake of safety, when presented with uncertainty.
Lidar would make it pretty obvious whether it's a sign or a car, even if the camera didn't tell you. The part where the lidar doesn't bounce back at vehicle level would be a dead give away.
That's literally trivial for a car with radar to detect.
In principle that is correct… but radars in automotive application are unable (or rather not used) to detect non-moving targets ?
Asking this because I know first hand that the adaptive cruise function in my car must have a moving vehicle in front of it for the adaptive aspect to work. It will not detect a vehicle that is already stopped.
The resolution of the radar is pretty good though, even if the vehicle in the front is just merely creeping off breaks… it does get detected if it is at or more than the “cruising distance” set up initially.
My understanding is that your typical automotive radar will have insufficient angular resolution to reliably distinguish, say, an overpass from a semi blocking the road, or a pedestrian standing in the middle of the road from one on the footpath.
Radar does however have the advantage of measuring object speed directly via the doppler effect, so you can filter out all stationary objects reliably, then assume that all moving objects are on the road in front of you and need to be reacted/responded to.
So I think it's the case that radar can detect stationary objects easily, but cannot determine their position enough to be useful, hence in practice stationary objects are ignored.
Adaptive cruise control is solving a totally different problem. It is specifically looking for moving objects to match pace with. That's very different from autonomous driving systems.
Radar is quite good at finding stationary metal objects, particularly. Putting it in a car, if anything, helps, because the station objects are more likely to be moving relative to the car...
The kicker for me is that the area covered by the ultrasonic sensors is essentially all blind spots for the cameras. The sensors currently are able to tell you when something too low to see is getting within a few inches of the car. It also gives an exact distance when parking, so I can know that I'm parking exactly 2ft from the wall every time. As much as they claim otherwise, it simply cannot be a matter of fixing it in software. The cameras can't tell you what they can't see. They simply don't have the coverage to do this, and clearly don't even have the coverage to hit parity with radar enabled autopilot either.
The sensors are unreliable and expensive in terms of R&D. Having marginal parts which takes money from a finite R&D budget can easily result in a worse product. “They contribute noise and entropy into everything.” … “you’re investing fully into that [vision] and you can make that extremely good. You only have a finite amount of spend of focus across different facets of the system.”
His standpoint can be summed up as “I think some of the other companies are going to drop it.” Which would be really interesting if true.
> Having marginal parts which takes money from a finite R&D budget can easily result in a worse product.
"Less sensors can be more safe/effective if that allows us to focus on making effective use of the sensor information we do have, which is the result we're aiming for with this descision." would be a reasonable answer (if true), but that doesn't seem like a fair interpretation of what he actually said.
It’s fairly rambling but he touches on that exact point several times most specifically here at the 2 minute mark:
“Organizationally it can be very distracting. If all you want to get to work is vision resources are on it and you’re actually making forward progress. That is the sensor with the most bandwidth the most constraints and you’re investing fully into that and you can make that extremely good. You only have a finite amount of spend of focus across different facets of the system.”
Which was from this section: Q: “Is it more bloat in the data engine?”
“100%” (Q:“is it a distraction?”) “These sensors can change over time.” “Suddenly you need to worry about it. And they will have different distributions. They contribute noise and entropy into everything and they bloat stuff”.
Even earlier he says:
“These sensors aren’t free…” list of reasons including “you have to fuse them into the system in some way. So that like bloats the organization” “The cost is high and you’re not particularly seeing it if your just a computer vision engineer and I am just trying to improve my network.”
> Isn’t that his point though? Where exactly is he not saying that?
Andrej eventually gets to it. But his first response was to evade. Lex is a skilled interviewer. By not letting him wriggle out of a difficult question we eventually got a substantive answer. But Andrej's first instinct was to evade. That's notable.
I don't agree Lex is a skilled interviewer, he's great at creating interesting conversations in the aw-shucks way Joe Rogan is, but he mostly plays a fanboy role. I still love a Lex interview.
Didn't they used to talk about how the Tesla radar could actually see the reflections of the car ahead of the one just in front of you? i.e. the radar reflection bouncing underneath the car just in front of you?
This is what doesn't add up to me. Either a lot of that previous wonder-talk was actually a lie, or there's something else going on here.
It is a hard question to answer. It’s like asking if more programmers on a project will allow it to be completed faster with higher quality. Ya, theoretically they could, in practice not likely. More sensors are like more programmers, theoretically they can be safer and more effective, but in practice they won’t be. Sensor fusion is as hard a problem as scaling up a software team.
LIDAR can be safer than an optical system, I can believe that. LIDAR and an optical system being safer than either alone without a lot of extra complexity: maybe not.
That isn't it though. It isn't like pumping a baby out in 1 month using 9 women. No, the problem is the fusion of too much information that varies substantially. They have completely different views of the world and you can't just lerp them together.
I bring up the programmers working on a project example just to illustrate how more isn't always better even if it theoretically can be.
I mean yeah, but it is a friggin ton heavy object moving at high speed controlled by a computer. Having another kind of sensor system to cross-check might be the reasonable thing to have, even if you happen to make it work well in 99% of the cases just with optics — the issue is that the other 1% kill people.
Your optical system can be good as heck till a bug hits it directly on the lense coving an important frontal area and make it behave weirdly.
In your metaphor it's like asking if you should have project managers as well as engineers on your project. And Tesla has decided that having only engineers allows them to focusing on having the best engineers. And they avoid the distraction of having to manage different types of employees.
Different sensors are even worse for sensor fusion. Actually, it only applies to different sensors, incorporating different signals with different strengths and weaknesses into a model that is actually better and not worse, is difficult.
Lack of focus is a major problem for companies and we all know that tech debt leads to increased bug counts.
Team focus on vision which is by far the highest accuracy and bandwidth sensor allows for a faster rate of safety innovation given a constant team size.
Tesla's cameras often get blocked by rain or blinded by the sun or not see that well in the dark. It's really hard to imagine those cameras replacing the ultrasonic sensors which do a pretty good job at telling you where you are when you're parking etc. I can't see how the camera is going to detect an object at pitch dark and estimate the distance to it better than an ultrasonic sensor. But hey, if people ding their cars it's more revenue.
The bottom line seems to be that the part shortages would have slowed production and cost cutting. The rest of the story seems like a fable to me. It was pretty clear Tesla removed the radar because it couldn't get enough radars.
The interview didn't really impress me. I'm sure Andrej is bound by NDA and not wanting to sour his relationship with Tesla/Elon but a lot of the answers were weak. (On Tesla and some of the other topics, like AGI).
One interesting side effect of only using visual sensors is that the failure modes will be more likely to resemble human ones. So people will say "yeah, I would have crashed in that situation too!". With ultrasonic and radar and ladar it may make far fewer mistakes but it is possible they might not be the same ones people make, so people will say "how did it mess that up?"
Sadly, that’s the worst way to actually design the system. I’d rather have two different technologies working together, with different failure modes. Not using radar (especially in cars that are already equipped) might make economic sense to Tesla, but I’d feel safer if visual processing was used WITH radar as opposed to instead of radar.
I also expect an automated system to be better than the poor human in the drivers seat.
You have to eventually decide to trust one or the other, in real-time. So having multiple failure modes doesn't solve the problem entirely. This is called 'Fusion', meaning you have to fuse information coming from multiple sensors together. There are trade offs because while you gain different views of the environment from different sensors, the fusion becomes more complicated and has to be sorted out in software reliably in real-time.
> There are trade offs because while you gain different views of the environment from different sensors, the fusion becomes more complicated and has to be sorted out in software reliably in real-time.
If you're against having multiple sensors though, the rational conclusion would be to just have one sensor, but Tesla would be the first to tell you that one of the advantages their cars have over human drivers is they have multiple cameras looking at the scene already.
You already have a sensor fusion problem. Certainly more sensors add some complexity to the problem. However, if you have one sensor that is uncertain about what it is seeing, having multiple other sensors, particularly ones with different modalities that might not have problems in the same circumstance, it sure makes it a lot easier to reliably get to a good answer in real-time. Sure, in unique circumstances, you could have increased confusion, but you're far more likely to have increased clarity.
This is one side of the argument. The other side of the argument is that what matters more than the raw sensor data is constructing an accurate representation of the actual 3D environment. So an argument could be made (which is what this guy and Tesla are gambling on and have designed the company around), is that the the construction & training of the Neural out-weighs the importance of the actual sensor inputs. In the sense that even with only two eyes (for example) this is enough when combined with the ability of the brain to infer the actual position and significance of real objects for successful navigation. So as a company with limited R&D & processing bandwidth, you might want to devote more resources to machine learning rather than sensor processing. I personally don't know what the answer is, just saying there is this view.
The whole point of the sensor data is to construct an accurate representation of the actual environment, so yes, if you can do that, you don't need any sensors at all. ;-)
Yes, in machine learning, pruning down to higher signal data is important, but good models are absolutely amazing at extracting meaningful information from noisy and diffuse data; it's highly unusual to find that you want to dismiss a whole domain of sensor data. In the cases where one might do that, it tends to be only AFTER achieving a successful model that you can be confident that is the right choice.
Tesla's goal is self-driving that consumers can afford, and I think in that sense they may well be making the right trade-offs, because a full sensor package would substantially add to the costs of a car. Even if you get it working, most people wouldn't be able to afford it, which means they're no closer to their goal.
However, I think for the rest of the world, the priority is something that is deemed "safe enough", and in that sense, it seems very unlikely (more specifically, we're lacking the tell tale evidence you'd want) that we're at all close to the point where you wouldn't be safer if you had a better sensor package. That means, in effect, they're effective sacrificing lives (both in terms of risk and time) in order to cut costs. Generally when companies do that, it ends in law suits.
> You have to eventually decide to trust one or the other, in real-time.
More or less. You can take that decision on other grounds - e.g. "what would be safest to do if one of them is wrong and i don't know which one?"
The system is not making a choice between two sensors, but determining a way to act given unreliable/contradictory information. If both sensors allow for going to the emergency lane and stopping, maybe that's the best thing to do.
It's far from the worst way, because if humans are visually blinded by the sun or snow or rain they will generally slowdown and expect the cars around them to do the same.
Predictability especially around failure cases is a very important feature. Most human drivers have no idea about the failure modes of lidar/radar.
A car typically doesn't have lights shining in all directions. My Tesla doesn't at an rate. At night, backing into my driveway, I can barely see anything on the back-up camera unless the brake lights come on. If it's raining heavily it's much worse. But the ultrasonic sensors are really good at detecting obstacles pretty much all around.
Interesting. I find the rear camera in my Tesla is outright amazing in the dark. I can see objects so much more clearly with it than with the rear view mirror. It feels like I'm cheating... almost driving in the day.
Reverse lights are literally mandated by law. Your Tesla has them, and if they're not bright enough that's a fairly cheap and easy problem to fix relative to the alternatives.
The sensors also detect obstacles on the side of the car where there's no lighting. Every problem has some sort of solution, but removing the ultrasonic sensors on the Tesla is going to result in poorer obstacle detection performance. Sure, if they add 360 lighting and more cameras they can make up for that.
EDIT: Also I'm not quite positive why the image is so dark when I reverse at night. But it still is. The slope and surface of the driveway might have something to do with that... Still I wouldn't trust that camera. The ultrasonic sensors otoh seem to do a pretty good job. That's just my experience.
EDIT2: I love the Tesla btw. The ultrasonic sensors seem to work pretty reliably, they're pretty much their own system, the argument about complexity doesn't really seem to hold water and on the face of it the cameras won't easily replace them...
You are greatly overestimating the functionality of the sensors, and underestimating the importance of the rest of the system. Sensors are important, but the majority of the work, effort and expense is involved with post-sensor processing. You can't just bolt a 'Lidar' on to the car and improve quality of results. Andrej and other engineers working on these problems are telling everyone the same story. The perfect solution is not obvious to anyone, and they have chosen one path. Engineers aren't trying to scam people out of a few dollars so they can weasel out of making high quality technology. This has Nothing to do with cost-cutting.
"The perfect solution is not obvious to anyone, and they have chosen one path. Engineers aren't trying to scam people out of a few dollars so they can weasel out of making high quality technology. This has Nothing to do with cost-cutting."
Lidar vs. Stereo camera vs. multiple cameras vs. ultrasound is a separate problem that engineers are trying to solve, not how can we sell cheaper mops. The decision to not use Lidar, as he says, and is the common debate being explored by people working on autonomous driving is whether it makes more sense to focus on stereo image sensors with highly integrated machine learning, or maybe use Lidar or other sensors and include data Fusion processing. Both methods have trade-offs.
"Lidar vs. Stereo camera vs. multiple cameras vs. ultrasound is a separate problem that engineers are trying to solve, not how can we sell cheaper fucking mops."
Okay? Tesla is a car company and they are absolutely trying to sell a cheaper car. That's obvious to anyone that's been in one.
"Both methods have trade-offs."
Right, isn't that why most other systems use both?
Both methods have trade-offs as in there are positive and negative merits for both approaches. Using both systems requires the sensor data to be fused together to make real-time decisions. This is the whole point, why people are trivializing this problem, and why it is easy to believe that they are just trying to scam people by going cheap on using multiple sensors. If you want to argue that it is better to use Lidar then explain why apart from 'others do it'. The podcast, and previous explanations by this guy and others that agree with him (which occurred way before some shortage issues) is about what is the best way to solve autonomous driving. You don't solve it by simply adding more sensors. There are multiple hours of technical information about why this guy Andrej thinks this way is best. Others make arguments for why multiple sensors and fusion makes more sense. No one knows the correct answer, it will be played out in the future. Maybe what some people care about is cheaper cars. That is not what the podcast was about, that is not how the Lidar + stereo camera vs. stereo-camera only decision was made. And in terms of the advancement of human civilization it is not interesting to me whether Tesla has good or bad quarterly results compared to what is the best way to solve the engineering problems & the advancement of AI, etc. I don't really care very much but it is slightly offensive when many people just dismiss engineers who are putting in tons of effort to legitimately solve complicated problems as if they are just scam artists trying to lie to make quick money. That is also a stupid argument. No company is going to invest billions(?) of dollars and tons of engineering hours into an idea they secretly know is inferior and will eventually lose out because they can have a good quarter. That is not a serious argument.
I am an engineer working on autonomous vehicles. Nothing personal just responding to the thread as a whole. I don't believe this guy is conspiring to trick anyone. Business decisions, or course. I think they are in good faith gambling on this one approach. So I am interested to see if their idea will win, or if someone else figures out a better way.
There problem is not that he was wrong, the problem is that he's made a motherhood statement in response to a very specific question.
He's not conspiring to trick people per se but he's also not being super clear. His position obviously makes it difficult to answer this question. It's possible he really believes this is better but if he didn't he wouldn't exactly tell us something that makes him and his previous employer look bad. Also his belief here may or may not be correct.
Is it a coincidence that the technical stance changed at the same time when part shortages meant that cars could not be built and shipped because of shortages of radars?
More likely there was some brainstorming as a result of the shortages and the decision was made at that point to pursue an idea of removing the additional sensors and shipping vehicles without those. This external constraint makes believing the claims that this is actually all around better, while hearing some reports of increases in ghost braking (anecdotes) a little difficult. Not clear if there was enough data at that time to prove this and even Andrej himself sort of acknowledges that it's worse by some small delta (but has other advantages, well shipping cars comes to mind).
So yes, sensors have to be fused, it's complicated, it's not clear what the best combination of sensors is, the software might be larger with more moving parts, the ML model might not fit, a larger team is hard to manager, entropy - whatever. Still seems suspicious. Not sure what Tesla can do at this point to erase that, they can say whatever they want, we have no way of validating that.
Maybe you're right, I don't care about Tesla drama.
Here is one possible perspective from an engineering standpoint:
Same amount of $$, same amount of software complexity, same size of engineering teams, same amount of engineering hours, same amount of moving parts. One company focuses on multiple different sensors and complex fusion with some reliance on AI. Another company focuses on limited sensors and more reliance on AI. Which is better? I don't think the answer is clear.
The other point is that I am arguing that many people are over-stating the importance of the sensors. They are important, but far more important is the post-processing. Any raw sensor data is a poor actual representation of the real environment. It is not about the sensors, but about everything else. The brain or the post-sensor processing is responsible for reconstructing an approximation of the environment. We have to infer from previous learned experiences of the 3D world to successfully navigate. There is no 3D information coming in from sensors, no objects, no motion, no corners, no shadows, no faces, etc. That is all constructed later. So whoever does a better job at the post-processing will probably out perform regardless of the choice of sensors.
People absolutely get that. Their issue is that Tesla is only relying on visual data and then on what is a disingenuous basis, insist that this is okay because humans "only need eyes" or some other similar sort of strawman argument.
Okay so they are "good faith" gambling? I don't want to drive in a car that has any gambling... I don't get how it being in good faith (generous on your part) makes it less of a gamble?
Uhh highest accuracy and bandwidth for what? You can have a camera that can see piece of steak at 100K resolution at 1000 FPS but doesn’t mean you can use a camera to replace a thermometer. Blows my mind how people eat up that cameras can replace every sensor in existence without even entertaining basic physics. ML is not omnipotent.
For the specific task of (for example) cooking a steak it’s not hard to envision a computer vision algorithm coupled with a model with a some basic knowledge of the system (ambient temperature, oven/stove temperature, time cooking, etc.) doing an excellent job.
No, I can't envision this. Surface texture alone will not tell you if meat is cooked. There is no getting around the temperature probe.
Now, simple color matching models are used in some fancy toasters on white bread to determine brownness. That's the most I've ever seen in appliances...
I don't think it was your intent, but your statement makes it seems like all Tesla engineers are looking at Twitter code. I bet this number is closer to 4.
Tesla has ca. 1000 software engineers working in various capacities. The ca. 300 that work on car firmware and autonomous driving are probably not participating in the Twitter drama.
I don't think the goal is to review all Twitter source. That should be the job of the (new?) development team. I think the goal was to look at the last 6 months of code, especially the last few weeks, for anything devious.
> "Team focus on vision which is by far the highest accuracy and bandwidth sensor allows for a faster rate of safety innovation given a constant team size."
By hiding the ball that you are starting from a much more unsafe position
> vision which is by far the highest accuracy and babdwidth
They are literally the least accurate of all sensors.
Radar tells you distance and velocity of each object. Lidar tells you size and distance of each object. Ultrasonic tells you distance. Cameras? They tell you nothing!
Everything has to be inferred. Have you tried image recognition algorythms? I can recognise a dog from 6 pixels, the image recognition needs hundreds, and has colossal failures.
We have no grip on the results AI will produce and no grasp on it's spectacular failures.
> In other words, the results are better, but not enough to make up for the fact that Telsa can't support additional sensors without incurring a prohibitive amount of additional risk to Tesla. Risk to passengers doesn't appear to be a consideration.
You may be right about the actual decision process Tesla went through, but Karpathy is right in principle. One of the first things he says is "there can be problems with [the sensors]", and a lot of what he mentions increases the risk of run-time failure, not just cost.
It's easy to cast this as an optimization problem where you're trading off asymptotically improved sensing for linearly or superlinearly increased failure rates. There's certainly a point where the complexity of more sensors or certain types of sensors outweighs any marginal benefit they provide.
Taking his point to the extreme why use 8 cameras? just use 4? 1? One photo-diode?
Cameras can also fail at run-time there can (and is) be variability in how they're mounted, in the lenses, in the sensors. They can get blinded or not get enough light. Their cabling can fail random components can fail.
Tesla has claimed that vision outperforms vision+radar but anecdotal reports don't seem to support that conclusion. IMHO these technologies are not directly replaceable, but are complementary. It's like you can't replace your ears with your eyes (yeah, you can read lips, if they're visible).
But sure, there is a sweet spot. Is Tesla really optimizing for best performance at any cost or are they optimizing making more money and selling that to us as an improvement? That's really the question and I don't think we got a frank answer there.
I would also add that Tesla's sensor systems, while perhaps higher quality, are not exactly new ideas. In one form or other laser/radar-based systems have been in cars going back to the 90s for early collision avoidance, automatic cruise control, etc.[1] Longer in other applications.
At least one study seems to suggest those sensors when deployed in automatic emergency braking systems do have a measurable impact on collisions.[2]
Let's say the failure rate on the sensors was 1 in 100 (I'd be shocked if that many were defective). That means 99 other Teslas are using mutli-sensor systems and not driving with degraded capabilities. It's an asinine claim that doesn't pass basic logic tests. The only way they weren't a substantial improvement is if Tesla's measurements were conducted in only the absolute most ideal conditions for cameras and no other scenarios.
> Is Tesla really optimizing for best performance at any cost or are they optimizing making more money and selling that to us as an improvement?
More likely they had a fixed budget and optimized with that constraint, if they made a rigorous decision at all. But this is guesswork.
I'm not speculating about how Tesla made the decision, just commenting on Karpathy's answer. His answer is correct even if it isn't true, i.e. even if it isn't what Tesla actually did.
There are plenty of well-known analogs, like the mythical man-month. We all know that throwing more x at a problem is routinely counterproductive, even without cost as a constraint.
It's like the joke about the mathematician in the hot air balloon ... His answer is correct but it's not useful. It is correct there is some optimal solution short of an infinite number of sensors/technologies and larger than no sensors. The argument that Tesla is converging on the optimal solution vs. the more or less known reality that they couldn't get the components they needed to build enough cars is weaselly. But hey, necessity is the mother of invention. Also he can't actually share anything from his work in Tesla because presumably he's under NDA but he's gotta say something.
> It seemed like all the "full cost" negatives Andrej mentioned were related to Tesla's ability to execute, and not what would actually produce better results.
This is objectively wrong, and it's the only substantive part of the discussion. The rest is fantasizing about things nobody actually knows ("It's media training!") and imputing questionable motives to someone who hasn't done anything to deserve that ("He only cares about Tesla's bottom line!").
You could take any one single point in a complex multifaceted argument to the extreme and basically strawman it to death. But that’s not helpful.
I believe his point was to provide a new perspective on the problem, not to reduce the problem to a single reason. I highly doubt the only reason Tesla chose to use vision only in the short term was motivated by a single datapoint.
Even if it was the most important point... in this one person (on a large team’s) mind... it doesn’t necessarily mean it was the most important in the sum of the complex process it took to get to the decision.
So I don’t really see the value in taking it to the logical maximum because it’s not only illogical that they would be evaluating this one idea in isolation but even on its own they would still be balancing the optimal performance they got from x vs the optimized value they got from y, then compare it to the teams ability to work with both x+y(+z) at the same time.
For ex: You’d probably need 8 cameras pointing different directions vs one highly capable rapidly spinning LiDAR to even compete with it, so why even ask? These problems a) always have context and b) can't be so easily simplified and broken down.
Although you might make a good point that Tesla used this same poor logical-maximum reasoning to determine why not get rid of ALL sensors besides vision.
At least 6 are needed to get a 360 degree view around the car, which obviously is necessary. Think of the 8 cameras as a single better sensor. It's a question of having one very good sensor then or many to fuse.
There are other ways of optimizing for reliability, though, like redundancy in parallel or higher spec’d sensors. But that still gets back to the same issue where they are going to be concerned about cost.
We really should be focusing on what is the best solution and trying to solve price issues through existing techniques e.g. economies of scale, competition, miniaturisation. Instead they are trying to build whatever solution they can that fits in a pre-defined cost window.
Except this isn't a new phone or sneakers we are trying to take to market it's something that will directly impact people's lives.
You can get to this conclusion if you're sure Andrej is lying, and that the risks cited are smoke; but only then. BTW, I've upgraded my sneakers after a couple falls on a rough beach with tangled driftwood (drift trees, really) proved their cheap too-slick surface had real world consequences. I was lucky not to break a bone. I'm going to bet he isn't lying, but I can understand someone making the opposite bet, market competition being market competition.
True, he could be wildly misled but he's been around doing this for a while, so that seems unlikely. He could be truly delusional in either case but it's still kinda necessary to knock down his arguments or logic; and that's the critique or analysis I'm not seeing. Just assertions again and again that the real reason is economics triumphing over safety. I'm beginning to think that the idea that there are genuine trade-offs in life is just ungraspable or offensive to many.
There's a lack of evidence either way, which really should tell you all you need to know. I don't think they're delusional, but they are constrained by their context.
Yes, with ML models, you often can be better off trimming down your sensor data. Usually though, you don't remove entire categories of sensor data. Even when you do, to be confident in such a move, you need to first achieve a working model with whatever data you have, and then through refinement you can prove that whole categories of data are more hinderance than help. They haven't done that.
It seems quite clear that the reality is a full sensor array is just economically non-viable with their business model, and that's framing their whole thought process.
Because they promised FSD back in 2017, they can't acknowledge that going without those sensors means it's going to take them much longer to achieve FSD. Because of safety/regulatory oversight, they also can't acknowledge that going without those sensors means there will be additional safety risk.
So they're stuck making these rationalizations that everyone in the industry knows are at best half-truths. No doubt, at some point we'll figure out how to do self driving with a much more limited sensor package, and when we do, we'll achieve a significant improvement in the cost effectiveness of self-driving.
In the meantime, there's a lot of "rationalization" going on.
Everything in engineering has a cost tradeoff, and always has. And peoples lives are improved by things which they can afford There is no "best solution" you can talk seriously about without talking about cost.
Why not have a thousand sensors if more is better?
Whose money should “we” be spending on this grail quest?
This mindset is something I see a lot, that “best” means the technically optimal (or sometimes just personally most convenient) solution to the specific problem that they personally are working on. If they take a step back and look at the bigger picture, the technical merits are usually only a tiny part of the whole decision.
There's an opportunity cost trying to get to the best solution. What about all of the people that die in the mean time while we delay rolling out something due to it not being perfect? Just doing some googling the Waymo stack is estimated to cost somewhere in the range of $50-100k, not including the car. A better solution that no one can afford is no solution at all.
Ultimately the only requirement is that the system is safer than humans by some margin that people are comfortable with buying such a system. If that amount is even as little as 2x safer than humans, we still have a moral obligation to roll that out even if we could be 5x safer if we had another $50k worth of sensors and processors on the car.
These kinds of moral arguments are silly because they hit a brick wall as soon as they encounter how society operates in practice. If Tesla has a moral obligation to roll out an FSD that's a just a little safer than humans, then do they not also have an obligation to make it available to all their competitors? If not, does every individual have a moral obligation to buy only Tesla cars? Do governments now have an obligation to subsidize Tesla cars so anyone can afford them? Etc.
And all this only considers first order effects. If a 2x safer FSD feels more dangerous that normal driving and thus reduces FSD uptake for a decade, doesn't Tesla have a moral obligation not to release it to preserve the perception of safety of self driving technology?
Wonder if this is a strong argument for public transit too. While self driving cars are developed let make every effort to figure out if we can get people to take the existing self driving trains and subways…
I feel the other reason is that Tesla has not figured out a way to put Radar into their ML pipeline. If you take the Range-Doppler Map from the radar as the 'pixel' map, that data is inherently very dependent on the scenario and the radar sensor intrinsic parameters. This variability in what the radar sees in the RD space is what makes this a challenge for ML/AI pipelines.
If Tesla were to 'fuse' information from these sensors in the object track level - I believe they will be less susceptible to this variability.
Exactly. Radar gives you direct range data; camera pixels need to be processed by ML to infer range data, and the latter is never going to be as close to ground truth as the former, so the former should be prioritized.
Not quite. Light waves are so short you'll get some return from almost any surface, because the surface is rough at the scale of such a small wavelength. This isn't true of radar, and it's not just what substance the outbound radar hits but how flat it is, too. You may get no return. Or almost none. Even smooth, round steel posts give very little return IIRC. There's also an echo problem with long waves such as sound and radar, particularly in urban areas. In which case what you think is a firm direct return may be a very indirect return that happens to be in synch with the signal you were expecting.
Its interesting, that kind of object level fusion is a fairly different problem to training visual perception, following some of the less in fashion robotics techniques. I wonder if its a case of the Tesla engineers focusing on the fad technologies (or just their strengths) more than its a hardware cost thing.
That’s one way to read it, but in my own experience the “do one thing really well” approach can yield far better results. Meaning, if vision is truly sufficient and you do it really well rather than a bunch of sensors “okay”, that may actually be safer overall. You might get far more focused and practical results from your efforts.
I’m not saying this is definitely true, and at the moment we probably can’t verify it either. I’m just “steel manning” his case, as Lex loves to say.
I think you’re probably correct that the business aspect was a significant factor, but perhaps it wasn’t everything.
Devils advocate, if the cost of working to improve cameras to the point where they eliminate that delta is lower than the cost of using the sensors instead, then it is a net benefit
Yes, the current delta was not massive and will shrink over time.
By getting rid of the extra sensors they eliminate a temporary crutch and focus resources on the simple solution.
Not a new concept by the way. Henry Ford was obsessed with simplifying and eliminating every part that wasn’t necessary on the model T for virtually all the same reasons.
The difference is that Ford started with something that worked. The Ford T is noteworthy because of the way it was made, not for its abilities as an automobile.
Tesla is starting with something that doesn't work. No one has been able to achieve full autonomy yet, not even Waymo on its own turf, despite Waymo being well ahead of Tesla. I trust Tesla will be able to close the gap and be able to perform to the its current standards without radar and ultrasound, and it would be fine if the current standards weren't terrible in the first place. What I mean is that Tesla is currently at the awkward spot where it is good enough for cruise control, but not good enough to safely take a nap in the driver seat.
As for the "simple solution", you may know the saying "For every complex problem there is an answer that is clear, simple, and wrong". I think it applies here.
What crutch? What simplification? These sensors are widely deployed and have already been perfected. Systems which use only one modality are the crutch. Sensor fused systems will always be safer, and are the future.
This move is purely about screwing passenger safety for cost and sales.
> Regarding results, the best Andrej can do is, "In this case, we looked at using it and not using it, and the delta was not massive." In other words, the results are better, but not enough to make up for the fact that Telsa can't support additional sensors without incurring a prohibitive amount of additional risk to Tesla. Risk to passengers doesn't appear to be a consideration.
I think this mischaracterizes Andrej's response. If anything he is referring to a wholistic view of the vehicle, which includes but doesn't entirely consist of Tesla. For example, 5-10 years down the road, when sensors start going bad, consumers will appreciate fewer things to go wrong with a vehicle--that is one of the advantages of electric over ICE after all.
If anything this is an acknowledgement that George Hotz was right in focusing on optical sensors with Comma.ai.
1) He's not touching on the software cost of integrating different sensor data into the same trained machine learning model; it is likely far simpler to just stick to stereoscopic vision data (the same thing the human genome decided!)
2) That said, it seems at least theoretically advantageous to have a sensory system that exceeds that which humans are limited to; things like LIDAR can work in complete darkness and potentially spot, for example, pedestrians crossing a dark road without any reflective clothing on, where a vision-based system would fail (perhaps add infrared sensation?)
Anyway, doesn't AEB (automatic emergency braking) have to be installed in every car, by law, in the US, around now? And wouldn't that be less reliable if done via vision?
>it is likely far simpler to just stick to stereoscopic vision data (the same thing the human genome decided!)
There’s a lot more to perception while driving than just stereoscopic vision.
First, your stereoscopic “cameras” (eyes) are mounted in free-rotating sockets, which are themselves mounted in a rotating and swiveling base (your head/neck). Your eyes can do rapid single-point autofocus better than any existing camera. They also have built in glare mitigations —- squinting, sunglasses, and sun visors. This system is way more advanced than fixed cameras. Yes, even an array of fixed cameras with a 360 degrees field of view.
Then you have your sense of touch, your hearing, and your equilibrio sense. You feel motion in the car. You feel vibrations in the pedals. You hear road noise, other cars, sirens, and the engine (not much in EVs). You smell weird smells and know when you’re driving with your e-brake on or when there’s a skunk nearby. There’s a lot getting fused with the vision to make it all happen, and I think you’d be surprised how “broken” your driving capabilities would be if you took one of these “background” senses out of the equation.
My anecdote: I drive a manual transmission car. A few months back, I woke up with no hearing in my right ear. Spooked, I drove to urgent care. I could not drive well at all —- I was holding low gears for way too long. I learned that I use hearing almost exclusively to know when to shift. If you had asked me beforehand, I probably would have said that I’m visually monitoring the tachometer to know when to shift. Not the case. Also, I had a TERRIBLE sense of my surroundings. As I drive, I’m definitely building a model of the environment around me based on road noise, sound from other cars, sirens, and the like. Without hearing in just one ear, I felt very disconnected and unsafe. Living in California where lanesplitting is legal, I had several motorcycles catch me completely off guard. I had my hearing restored at urgent care and everything went back to normal immediately on the drive home.
I think Andrej and Tesla massively overestimate vision’s sole ability to solve the problem. Humans are fusing lots of sensation to drive well.
> likely far simpler to just stick to stereoscopic vision data (the same thing the human genome decided!)
Yeah and till we had reliable and powerful artificial lighting, it was highly unsafe to journey in low visibility/ darkness. We used to finish journeys when darkness fell.
Animals that do require precise movement in low visibility (bats, dolphins) conditions often evolved ultrasound solutions.
So should we license Tesla vehicles to only operate when visibility and weather forecast is good and not drive in the dark at all?
Excellent points, I didn't think about the fact that even evolution couldn't come up with a vision system that works as well in the dark as it works in the daytime.
Well actually, the human vision system at night, while not as good as cats and perhaps dogs, is still much greater than any camera we've come up with thus far, I read something that claimed we can actually detect individual photons hitting our retina, once we are adapted to the darkness
I think the key point he‘s trying to make is that the size of the fleet is more important than the quality of the sensor. The risk would be reduced by a better system and he seems to be convinced that rolling out vision to more and cheaper cars would get you there.
There is a great argument for having ultrasonic sensors and radar in a recent video by Rayan from FortNine discussing two fatal accidents involving tesla autopilot https://www.youtube.com/watch?v=yRdzIs4FJJg
lex's comment did not strike me as a dig. i am actually concerned by your comment because it makes me wonder if i am missing other things too? it just doesnt seem like a dig. it seems like he thought of something funny and wanted to share it. am i alone in this?
and also i dont understand your assertion that it was some kind of cynical maneuver to re-frame the question. he could have also said "yes, more sensors are always better but you can add an arbitrary number of sensors and so we had to decide where to draw the line. the cameras we use are capable of meeting our goal of full self driving that is significantly safer than a human driver. and this also streamlines the production and software which has a material impact on our ability to actually produce the cars which is of course necessary to meet the goal of making self driving cars. bloat could actually kill tesla."
this is logically the same thing that he said in the interview, so whats cynical about it? how is it underhanded?
also is there some intrinsic limitation of the dynamic range of cameras? people are talking about problems with dynamic range being intrinsic to cameras but im pretty sure that cameras and especially camera suites that do not have more problems with dynamic range than a human eye are possible to make and probably already on the market.
> also is there some intrinsic limitation of the dynamic range of cameras? people are talking about problems with dynamic range being intrinsic to cameras but im pretty sure that cameras and especially camera suites that do not have more problems with dynamic range than a human eye are possible to make and probably already on the market.
I think it's possible that professional movie cameras (with the appropriate lenses) may have higher dynamic range than human vision. Good luck getting those cheaper than a lidar.
just take some cameras, set each one at a different exposure via something like smoked glass and composite the images in real time. i dont know, it just seems pretty easy.
Well, that increases the expenses from the sensors several fold (at least 2x of course, but I would guess closer to 3x-4x, depending on how many you need to cover the whole range).
It wasn't a dig. It was calling out a bullshit move that, in my opinion, Andrej deployed out of panic more than strategically. (My evidence for this being Andrej eventually gave a good answer.)
I don't think it was calling out a bullshit move (and definitely not a dig)
a) Saying "I wonder when language models will do this" is a total Lex thing to say. That's what he's into.
b) Lex is almost always a softball interviewer, though one with interestingly deep knowledge. He interviews experts, and he errs on the side of respect. If you're looking for hardball, don't listen to Lex, it's not what he does. He almost never calls out bullshit, and he especially doesn't call out evasion.
Now, it's still possible that it was a panic-deployed evasion on Andrej's part, and whether or not that was his mentality, I agree that he gained nothing by it, and did not in fact reframe the question at all.
i just dont get it. lex says "lets re-frame the question: can a language model drive a car?" this doesnt have any insinuations about andrej's intentions or motives. if it were calling out bullshit it would be "lets re-frame the question: why is elon musk never wrong?"
lex kisses elon musks ass, last time i checked hes on musks side in the lidar debate and also lex has a record of listening patiently no matter what and i have never known him to check people or "call out their bullshit." lastly, what andrej did wasnt a bullshit move. re-framing a question has never been known as something people do only when they are being deceptive and it is very common in intellectually honest answers/explanations in my experience.
im still shocked because usually i see what other people see. but people calling this a dig/bullshit came out of left field for me. i hope i dont miss stuff like that more than i realize...
edit: i read your comment again and i didnt read it right. he panics, gives himself some more room, lex acknowledges this but in a prickly way. it makes sense but im still upset because when i watch it all i see is lex making a joke. i guess i have severe autism.
See my sibling comment, but I agree with you, especially about Lex's track record, and I think the other commenters are projecting.
And you're right, reframing a question is like THE MOVE for academics, so it's completely not evidence that Andrej was making "a bullshit move" (although it's also a classic move for PR flaks, so it's possible he was making a bullshit move).
Andrej did get around to answering the original question, he just wanted to say more, to put it into a bigger frame with more context. I had the same "weasling" concern at first; but his answer was more or less "You lose more than you gain, but yes there was a small delta; in exchange for which any organization would take on not just an economic hit but a lot of additional opportunities for process and maintenance errors; plus distracting the team." So he'll agree that in an ideal world you'd want 'em, just not want 'em that much; but in the real world, more geegaws that aren't really pulling their weight are a terrible idea.
Although he didn't explicitly say so, neither his answer nor Elon's "take it out 'cause you can always put it back in if it turns out you really need it" philosophy absolutely rule out lidar coming back in the future if some remaining edge case just requires it. Clearly he thinks this is quite unlikely, however.
You are making a lot of potentially faulty assumptions. 1) The "delta" was wide enough to save/harm people, you have no idea. 2) The extra information provided would always be valuable and/or not be overcome with better AI models using the visual sensors in the future. 3) The amount of technical overhead generated by the extra sensors were not prohibitive long term. When working with AI there are often times where it would seem logical that extra relevant data will always improve a model, but that turns out not to always be the case, or provide so little value that managing another dataset is just not worth it.
> I thought it was telling that Andrej immediately "reframed" the question because Lex asked the "wrong question". This is a classic evasion technique
I agree with this assessment. However:
> Telsa can't support additional sensors without incurring a prohibitive amount of additional risk to Tesla. Risk to passengers doesn't appear to be a consideration.
This is a stupidifying take. Of course when you work in a line of business producing gadgets that, as an unintended side-effect, kill a lot of people (napkin math suggests above 2 milli-kills per car in the US), you will need to pick a point at which you say further fatality reduction is no longer justified given the economic cost of achieving it. Even if you are a pure altruist (if you go out of business, less safe cars will replace yours). Conversely, even if you are the embodiment of capitalist evil, risks to passengers will absolutely affect your bottom line and if are rational you will take them into consideration. Any meaningful criticism needs to be about the trade-offs they make, not that they make them or are loath to explicitly say so on camera.
> …you will need to pick a point at which you say further fatality reduction is no longer justified given the economic cost of achieving it.
You're right — the sad truth is that corporations put costs on human lives every day. Where I think we disagree is that you believe they made the decision based primarily on costs. After watching this video, I believe they made the decision because they didn't think they could reliably implement and support a sensor fusion approach.
(BTW, I enjoyed "stupidifying"! I'm sorry I made people stupider.)
Tesla hasn't proven itself to be a capable major car manufacturer (they probably lead the minor category, at least in deliveries) in all but one: their de-prioritizing of human life.
Is there hard data on how deadly they are vs. other auto manufacturers? There is definitely a narrative that the cars are dangerous, but I'd like to see that quantified.
The majority of comments surrounding stereo-camera/lidar questions have a ridiculously simplified idea of the problem. It's obviously the case that 'more sensor good, less sensor bad'. This is frankenstein's monster level technical analysis. Why don't the majority of large-brained animals have many eyes, and many different antannea appendages processing an array of diverse sensory input? You don't just automatically gain by adding lots of sensors. The signals have to be fused together and reliably cooperate and come to agreement in real time for any decision. Any sensor is only providing raw crude data. The majority of the work involved is done by processing this crude data and inferring a much more sophisticated approximation of the real environment from prior knowledge, hence using neural nets with pre-trained data. It is a good debate whether the
approximation can be done better by adding more sensor input and diverting R&D & processing resources towards fusion as opposed to improving the results that can be obtained from stereo image sensor. It's not obvious to anyone. And nature seems to inform us that most large brain animals evolve to rely heavily on two eyes instead of 16 eyes + lasers. This is an interesting discussion, but the issue isn't 'tesla could just bolt a Lidar box to the roof and magic, but they want to scam you out of a few extra bucks'. That is a moronic idea.
> Why don't the majority of large-brained animals have many eyes
Because of the cost of additional eyes. If Tesla is optimizing for cost against safety, that's sort of the point.
I don't believe that's totally the case. Andrej later makes a better argument regarding limited R&D bandwidth, noise and entropy. But the "I would almost reframe the question" evasion was disconcerting. It's a textbook media trained tactic for avoiding a question to which you have no good answer. That it was deployed here badly against a skilled interviewer such that it backfired is a valid observation.
Everyone has limited R&D and processing bandwidth. It's not just Andrej saying this, but anyone working on engineering autonomous vehicles. It has nothing to do with the cost of additional eyes. This is over-simplifying the problem. Our eyes don't work that way. The data coming from eyes and image sensors is very crude and relies on either your brain or very sophisticated post-sensor processing to construct a 3D approximation of the actual environment. The sensors themselves don't provide this information. They don't distinguish distinct objects, corners, shadows vs. changes in color, or all sorts of phenomena that doesn't actually exist in the sensor data. This has to be inferred later by a brain or processing that relies on prior assumptions and 'training' by previous experiences with 3D environments. I don't really care what Tesla is doing vs. what other companies are doing, but these 'cost cutting' arguments don't matter. I would suspect that the R&D invested into the Machine Learning infrastructure, and the custom IC, and the software engineering out-weighs whatever amount they could save by removing a small sensor. And I don't believe that this guy Andrej is conspiring to squeeze a few bucks out of his customers at the expense of degrading his life's work. He is not trying to sell mops.
It seemed like all the "full cost" negatives Andrej mentioned were related to Tesla's ability to execute, and not what would actually produce better results. Tesla would have to be able to reliably procure parts, write reliable firmware, create designs and processes that won't increase unexpected assembly line stops, etc.
Regarding results, the best Andrej can do is, "In this case, we looked at using it and not using it, and the delta was not massive." In other words, the results are better, but not enough to make up for the fact that Telsa can't support additional sensors without incurring a prohibitive amount of additional risk to Tesla. Risk to passengers doesn't appear to be a consideration.