Imagine a person with low vision walk into a coffee shop, order from a wall menu and then sit down to read a newspaper. Or that same individual producing the right fare amount for a taxi to the airport, and then reading the flight-board to find their gate for departure. While these examples might seem blasé to the fully sighted, to those with low vision, they represent a giant leap in independence. The clever little device that makes these scenarios possible is the OrCam, a computerized wearable camera that may very well revolutionize the daily lives of people with low vision. Developed by an Israeli company, OrCam looks similar to Google Glass: the user wears a mini camera attached to their eyeglasses. Yet OrCam is designed to relay information back to the user through an earpiece. All the user needs to do is point to whatever they want to “see” and within a fraction of a second the device will relay the information back to them through a discrete earphone. This allows people with low vision to read text effortlessly and to more easily navigate their way through the world.
Although today’s high-tech market offers a variety of assistive devices, many are limited to specific tasks, such as reading text from flat surfaces only. But OrCam packs more processing punch than anything else currently on the market and it’s highly versatile: it can read any printed text, recognize and remember faces, places and objects, and can even tell you if a traffic light is green or red.
Created in 2010, OrCam’s co-founder and chairman, Amnon Shashua, who is also a professor of computer science at The Hebrew University of Jerusalem, spoke with ABILITY’s Chet Cooper about how the technology evolved, how it works and differs from other technology, and the company’s plans for its future development.
Chet Cooper: How did you get involved in OrCam?
Amnon Shashua: My field of expertise is computer vision, which is making computers, programming computers to see. I’m doing that at the university. I founded a number of companies in computer vision. I was working on building a model of visual interpretation which surpasses anything else that people contemplate, a model in which a camera observes a scene and tries to understand what’s going on in the scene. I was thinking, who can use such a capability, assuming that this capability can be developed? And one of the first things that came to mind was people who are blind. Someone who’s blind will find it useful to have a system that tells him or her what’s out there in the visual field.
But then, looking more closely into this area, we found the number of people who are blind is very small, and their needs are very complex. On the other hand, there are those with low vision, which is one level above blind. They see something but not enough to manage daily life and they also can benefit from the capability of having a computer translate the visual world for them. This is how the idea started.
Cooper: So prior to OrCam, what were you involved with?
Shashua: OrCam is a new company, but I have another company, Mobileye, which developed a system using processor chips and algorithms for interpreting visual information in the context of preventing car accidents. You have a camera mounted on the windscreen facing forward, recognizing people, cars, traffic signs, lane marks, monitoring the visual field and watching for an imminent accident. When an imminent accident happens, the car system starts warning the driver and actually makes evasive maneuvers, like automatic braking and so forth.
Cooper: Are there other products that do the same thing? In the States, we’re seeing a lot of car manufacturers marketing that technology now.
Shashua: We are the dominant supplier in that area. We supply systems to about 18 car manufacturers. When you hear about the camera system in a car doing driver assist, 80 percent of it’s ours.
Cooper: That’s really great. Are you involved in computer navigated driving, which might eventually make the need for steering obsolete?
Shashua: Yes, self-driving. We are very active in that area. There was an article in the New York Times by John Markoff a few months ago talking about our activity on autonomous driving.
Cooper: That relates to your OrCam demographic.
Cooper: How does OrCam work?
Shashua: I can explain what OrCam does in non-technical terms. Assuming you have low vision and a helper has been assigned to you, someone who sees what you’re looking at, who also has normal vision and sufficient intelligence to understand what kind of information you would like to extract from the scene. The only way you communicate with the helper is by pointing. You point at an area of the scene you want to get information about, the helper figures out what’s out there, figures out what kind of information you would like, and then whispers in your ear. If have low vision and you’re standing at a bus stop, you know that the bus is approaching, but you don’t know the bus number. So you simply point in the direction, the helper looks at the scene, sees that there is a bus there and tells you the bus number.
Or say you you want to cross the street and there’s a traffic light. You know that there’s a traffic light, but you don’t know what color the light is. You point in that direction, the helper looks at the scene, sees the traffic light, understands you want to know what color it is and tells you the color.
Say you’re holding a newspaper. The helper immediately understands that you’re holding a newspaper. You point at an article and the helper figures out which article to start reading and starts reading it for you within a fraction of a second.
Shashua: Or you hold a familiar object. The helper will tell you what the object is. You can teach the helper new objects. Now, replace the helper with an ocular system.
Cooper: Was this based on Ray Kurzweil’s work in optical character recognition (OCR)?
Shashua: No, no. This is our own work. My field of expertise is computer vision and this is the one thing that I do. Existing OCR is the difference between, say, the industrial-strength OCR, and what we do in reading is that we have additional challenges. In existing OCR they assume you are scanning or taking a picture of the clean sheet of paper, with very legible text, so you can scan it or take a picture of it. It’s very structured and those programs will perform fairly well in OCR. They also assume that the surface is flat or more or less flat and take time to respond. It could take tens of seconds. It’s not something that is done in real time.
Our challenge was several-fold. First, you would like the text to appear on any surface. It could be a crumpled newspaper, one you’re holding in your hand. It could be a product which has texture on it and the text is written on the texture. It could be text in the wild, where you have a natural scene, like, you are looking in the outdoors and there is a street sign. You’re pointing in that direction, to make sure there is text to begin with and then start reading it. This has to happen within a fraction of a second, because if the function were to take tens of seconds to respond, then it’s not useful to the user. It has to respond within a fraction of a second. You point in a direction and immediately you get the response.
The existing OCR technologies, are based on a scientific core which was developed in the ‘80s. We developed the OCR capability from scratch, because what exists today is not suitable for our needs.
Cooper: That’s a little mind-boggling, what you’ve created. And this is all happening in a little system that is on the side of your glasses?
Shashua: The system has two pieces. One is the clip-on camera that you attach to existing eyeglasses. We designed the clip-on camera such that there is a small fixture which you insert on your eyeglasses, then the clip-on camera, you can easily remove and insert. That clip-on camera is connected via cable to a small processing block you place in your pocket. The idea is you communicate with a system not by pressing buttons but by simply pointing to the scene. The system automatically figures out what’s out there and what you need from the scene. You don’t tell the system, “I now want you to read,” or “I would like you to recognize an object for me,” or “I would like you to tell me what is the value of the money note I’m holding.” You only point and the system figures out on its own what is out there and what information you would like.
Cooper: So the system and the computer that’s doing all of this processing is not web-based?
Shashua: It’s all local. It’s all being done within the unit itself.
Cooper: How heavy is the unit?
Shashua: It’s the size and weight of a cell phone.
Cooper: Does it emit any electromagnetic radiation, as there’s concern about cell phones being near the head for too long?
Shashua: No. There is no wireless processing going on.
Cooper: How far along is the beta version? Where are you in your trials?
Shashua: We’ve been developing this for three years in stealth mode. We launched the company’s website in June, and during the launch we announced on the website we would be shipping the first 100 units in September. Within two hours of launching the website those 100 units were sold. And then we announced we would ship another 500 units by the end of the year, within a week those were sold as well. So now we’re only keeping a backlog, there is a slight delay in the units. We will be shipping the units in October, not in September. During this time I’ve been doing extensive testing, not only in preparation for production, but also testing against user groups. We have about 20 low visioned persons using the system day in, day out, so we can collect feedback. The response is very, very good.
Cooper: Have you done anything yet with different languages?
Shashua: At the moment it’s only English. We’ll be introducing French, German and Spanish.
Cooper: What is the cost base at this time?
Shashua: While we’re selling this through the internet, the cost at the moment is $2500. It’s not clear that we’ll continue selling this through the internet, particularly once people establish distribution networks. This price level was designed to be similar to the price of hearing aids. They cost between $1,000 and $4,000, so it’s kind of in a mid-range hearing aid cost.
Cooper: And that cost may go down depending on when you start building volume?
Shashua: Normally when volume goes up, cost goes down. But it could take time.
Cooper: Is anybody looking into insurance picking up the cost?
Shashua: We will, but it’s not the first thing that happens. First a product needs to create its own awareness. There should be a sizeable user base before regulators or insurance companies start taking action. We believe that once we pass the 10,000 unit shipment, the regulators and insurance companies will take action.
Cooper: I know in the States, medical device companies always have a bit of a challenge getting into the system. But it would seem, at least in the beginning, you have an audience that’s willing to pay out of pocket.
Shashua: One of the reasons why the acceptance is so immediate is because unlike people with hearing disabilities, who have technologies that can really change their lives, people with visual disabilities have really no technology to rely on. This is the first time where you have something that tries to give the same value as the hearing aid to someone who has hearing disabilities. It’s the most high-tech device ever introduced to the world of visual loss.
Cooper: I think it’s Star Trek, I didn’t watch it too much,
but one of the characters—
Amnon Shashua: That’s right, he had kind of a visor over his eyes.
Cooper: Do people compare the technology you’ve invented to that?
Shashua: It’s not the same thing. In Star Trek it is this visor that would feed directly into your brain, bypass the missing retina or the missing eyeballs of the user and feed directly into the visual cortex, and then the visual cortex would process the information and you see. We are talking about something else. We are not correcting your vision loss. We’re compensating for it.
Cooper: Yours is an accommodation. Is this where you focus most of your time?
Shashua: Yeah, it’s one of the things that I do. I’m cofounder and chairman of the company. My partner is Ziv Aviram. He’s the CEO. We now have about 30 employees working on the product. It’s very, very hectic releasing the first 400 units, and then there are big plans of there is a very concrete road map, not only adding more languages, it’s improving the hardware, adding more computing power, having a camera with high resolution, such that one can perform more visual interpretation and provide more value to the user.
Cooper: Can you share what companies you’re working with? For example, for the camera, who are you using for that part of the technology?
Shashua: The camera is a piece of art. It’s a camera by STMicroelectronics. It’s a 5 megapixel video. For example, if you have an iPhone 5 or a Galaxy S3, it has an 8 megapixel camera. But this is 8 megapixel stills. The video resolution is about 2 megapixels. We are talking about a 5 megapixel video, no stills, only video, because what the device is doing is processing video. It’s not taking a snapshot of the world, processing, taking another snapshot. There is video being fed into the processor, and this video is being processed. So it’s a 5 megapixel camera. That’s the camera side.
And in terms of the microprocessor, this is a free-scale, state-of-the-art i.MX 6. It’s a quad-core Cortex-A9 with attached DSPs inside. It’s the most powerful off-the-shelf microprocessor that you can buy today.
Cooper: Where are you manufacturing and assembling all this?
Shashua: We are manufacturing in China. At the moment we’re assembling this in Israel, so the first 500 units will be assembled in Israel. The idea is that all the manufacturing and assembly will be done in the plant in China.
Cooper: Do you visit China often? Is someone from your company based over there?
Shashua: Yeah, we’ve got a person in the country monitoring all the production in China.
Cooper: You mentioned that you’ll open up distribution in different locations rather than just being web-based. Do you think this will happen next year?
Shashua: Since we launched our web site, we’ve received more than 90 requests from major distributors to cooperate. We are assembling this information, and within the next few months we’ll start building up the distribution. This kind of product is a product which requires a hands-on introduction. Due of the nature of the users, most of the users are elderly people, although there are many young people who are visually disabled, but many of them are elderly people, and it does require a certain amount of training to get the best value out of the product, to train how to point, make sure you point at the center of the visual field, things that for a seeing person look trivial, but for someone who doesn’t see well, it requires some training. So distributing this device requires some thought. It’s not something that you can simply purchase through the internet and get it at home with an instructional video. It requires a bit more care on how you introduce it. Even those first 500 units we’re shipping, they’re going to be shipped together with a hands-on human introduction. This is why we’re limiting to only 500 units, so we can control the first batch of users.
Cooper: Are some of them coming to the States?
Shashua: We’ll have people going to the States who will do the training for the first 500 users. Of course, it’s not something you can do for 10,000 people, and this is why the way we build the distribution requires some thought.
Cooper: I would think you’d be looking at organizations such as nonprofits that have affiliates around the country? You could train the trainers through the national organization and they train the affiliates, that way you’ll have the boots on the ground to be able to—
Shashua: Definitely. You can also think of optometrists, they can be a source for distribution of the device, because they handle people with visual disabilities. Normally if you are visually disabled, you also have eyeglasses, so it comes together.
Cooper: Has OrCam been compared to Google Glass?
Shashua: First, I don’t have a Google Glass. All I know about Google Glass is what I read on the internet. Google Glass is not designed for this kind of use, not in terms of computing power and not in terms of the ability to process video. It looks like with the Google Glass the idea is to have a powerful audio processing with which you can speak to the device and the device will execute commands, and those commands are to take a picture, to take a video, transfer this to your smartphone, transfer from your smartphone to your display on the Google Glass, textual information, like e-mails, text messages, instruction on navigation, things of that nature. It’s not designed for the heavy computing required to do real-time video processing. It looks like a different beast, it appeals to different kinds of users and different capabilities.
Cooper: What would happen if—remember the old-time movies, they weren’t talkies, they were just movies that would stop every so often and have text?
Cooper: What would happen if somebody used OrCam in a theater? If you point it at the screen, what would happen?
Shashua: We tested it on reading subtitles on a movie, and it does it quite well. You can watch a movie. Right now you have to point toward the screen, but we can activate such a function without pointing. If there’s a motion picture in front of the camera and there’s text below, the system can start reading the text, the subtitles. So it works fast enough to read the subtitles, even though that was not the design of the unit. But it’s capable of reading subtitles.
Cooper: When you say “point,” are you talking about literally a finger pointing? Or any item someone would use as a pointer?
Shashua: You are pointing with your finger, because the camera is looking forward into the world and it sees your finger. What the camera does, it gets a video from the camera and the system continuously looks for the user’s finger. Once it finds the finger, it triggers a function of trying to do visual interpretation. It not only looks for your finger, it also looks for faces. Once it finds a familiar face, it will also—without you needed to point—tell you whose face it is. The basic trigger is pointing. You simply point with your finger. Of course, the finger should be in the field of view of the camera.
Say you’re holding a newspaper, and there’s a certain article you would like to read. You simply point at the article, and the system will do a layout of the newspaper, look at the closest column to your finger, and start reading it.
Cooper: Does it recognize your finger? What if you have gloves on?
Shashua: It’s not personalized to your finger; it recognizes “fingers” as a generic object class. Whether it will work with gloves, I’m not sure. We haven’t tested it on gloves. The idea is that your hand is naked.
Cooper: So if you’re in a cold climate, you need to take your glove off.
Shashua: Maybe, yes.
Cooper: Well, we need to fix that right away.
Shashua: (laughs) It’s hard to imagine an elderly user out there in the snow pointing with the device.
Cooper: Everyone from Iceland is saying, “We need the glove feature!”
Cooper: From what you’re already seeing and the feedback you’re getting, is there something that you’re already planning on fixing or putting into the next batch?
Shashua: During testing, things came out that made us do some modifications. The basic functions we were correct about, and when users tested the device, we didn’t need to change anything. But there is a difference between how a person with low vision uses the system compared to someone with normal sight. If you take someone with normal sight and teach him or her how to use the system, how to point, that the finger should be in the center of the field of view, after a few minutes the person understands and will activate the system in the way that is intuitive to someone who has normal vision.
Someone with low vision, it’s less obvious. With low vision the field of view is not uniform. There are areas of the field of view where one sees better than other areas. So what could happen is that someone could look in one direction and point in another direction, even though the person was trained not to do so. What we modified in the system is to detect such maneuvers, to detect unsuccessful attempts at pointing to the scene and providing feedback to the user. We also designed training software such that when you start the device for the first time, it activates a teaching mode in which you stand in front of a white wall and you get feedback as to where your finger is with respect to the field of view. Ideally you’d like to place the finger in the center of the field of view. So there are all sorts of things we’ve developed as part of getting feedback from real users.
Cooper: You also have the other challenge of different types of visual loss. They’re not all the same.
Shashua: They’re not all the same, but the kinds of functions the device provides are common to almost the complete spectrum of visual loss. The first thing that goes away is the ability to read. In order to read, you need very high contrast, and this is the first thing that goes away. This is why the basic layer of the device is the ability to read everywhere and anything, outdoors, indoors, newspapers, books, texts in the wild, street names on flexible surfaces, metallic surfaces, etc.
Second thing that goes away is the ability to recognize objects, recognizing faces, holding a money note and trying to figure out what value it is, especially if it’s US currency, because all of them are green and look the same. The degree changes from user to user. Some of them cannot recognize faces. Some of them have difficulty recognizing objects and money notes. So the device has the ability to recognize objects, and you can teach it new objects. In that way you can customize it to your particular type of visual needs.
Cooper: Is there some kind of artificial intelligence used?
Shashua: Yes. If you would like to teach it a new object, you hold the object and shake it in front of the camera. This shaking maneuver will trigger the system to learn this object, so the system will tell you, “Tell me the name of the object.” Say for example it’s a milk carton. You’ll say, “It’s my favorite milk carton.” The system will take an imprint, a picture of the object. The next time you point to this particular object, the system will repeat the name that you gave it.
Cooper: So it’s picking up the sound of your voice when you shake it, it’s programmed then to learn that image, recognize the voice command of that image, and so if you start dating a new person, you just shake her?
Shashua: Oh, okay. (laughs)
Shashua: It detects faces continuously. You don’t need to point. It scans the visual field and looks for faces all the time. When it sees faces, it tries to recognize them. If it doesn’t succeed to recognize, it doesn’t do anything. If you are standing in front of a person and you are pointing to that person, then the combination of detecting a face and detecting a finger pointing to that face triggers the system to ask you, “What is the name of this person?” You say the name of the person. The system recognizes, takes the imprint of the picture. The next time the person appears in the field of view, the system will simply state the name of the person. You don’t need to point.
Cooper: Is there a standard of what the person has to be doing, being face-on?
Shashua: It’s similar to the facial recognition capabilities that you find in Facebook, iPhoto and Picasa. There is a tolerance to a certain amount of variability. If you completely change your facial appearance, it will not recognize you again.
Cooper: What’s the memory? Is it capturing—
Shashua: It’s simply capturing the image. When you’re learning a new face, when you’re pointing to a face that you’re looking at and the system asks you for the name of the person, the system memorizes that particular image and it has a memory which is practically unlimited. There’s no real limit to the number of faces or objects that the system can recognize.
Cooper: That’s basically the only data it’s really archiving?
Cooper: What if you wanted to record and archive? Does it have that capability?
Shashua: It only records the new objects that you’d like it to learn and new faces that you’d like it to learn. That’s the only archive.
Cooper: Anything you shake or something that it recognizes as a face.
Cooper: I could see that might be one of the premium offers you might do in the future. If someone wanted to actually record something, they would have the capability to download.
Shashua: It is possible, but again, the user profile is people who are technology-averse. You don’t want to sell a product requiring a connection to a laptop or a computer, downloading, uploading. You limit it to a very small user base.
Cooper: I was just picturing somebody who might be going into small claims court, for example, and they want to record what’s happening, so they just use that system.
Shashua: The system doesn’t have a display, so it will record, but it would be a problem to display it, unless you connect it to a computer.
Cooper: That’s what I was picturing. You’d have a Wi-Fi, you’d download to a laptop, and you’d have that as a reference in the future.
Shashua: The device has a Bluetooth chip which at the moment is not being activated, but in future releases,