The project is an endeavor towards finding a solution to a navigational aid system that can guide a visually impaired people. The main challenge faced during this work is to design a system that can serve the purpose of both object detection and recognition and their distance calculation simultaneously. Microsoft Kinect helped us out as it has the facility to provide both RGB and DEPTH images through its stereo vision and IR camera. With the help of Kinect as the core input device and OpenCV Haar classifier for detection and recognition of objects, we are successful to build a prototype application that is trained to recognize objects and human faces in real time and also inform the user regarding object location in terms of direction and footsteps using speech synthesizer. More work is going on this project regarding improving performance in uncontrolled environment and adding more objects to recognize. A brief system description is given after wards.
Microsoft kinect is the sole input sensor device for our designed system. To describe what kinect is we can say, kinect is an electronic device which has integrated RGB and Depth camera and other motion sensing capabilities. Kinect builds on software technology developed internally by Rare, a subsidiary of Microsoft Game Studios owned by Microsoft, and on range camera technology by Israeli developer PrimeSense which developed a system that can interpret specific gestures, making completely hands-free control of electronic devices possible by using an infrared projector and camera and a special microchip to track the movement of objects and individuals in three dimension. This 3D scanner system called Light Coding employs a variant of image-based 3D reconstruction. The Kinect sensor is a horizontal bar connected to a small base with a motorized pivot and is designed to be positioned lengthwise above or below the video display. The device features an RGB camera, depth sensor and multi-array microphone running proprietary software which provides full-body 3D motion capture, facial recognition and voice recognition capabilities.
Driver interface is the layer which actually communicates with the sensor and the device driver. The interface actually depends on the device driver that will be used. There are several device driver available for communicating with kinect device. The kinect driver we have used to build our system is OpenNI. This driver’s interface can be used with other languages, which in our case is Python 2.7. We imported OpenNI packages in Python and communicate with kinect device to import the record the images acquired by the kinect.
OpenCV Haar classifier has been used for detection and recognition of object images. Initially several high definition videos have been collected for objects of interest. After that the static images are separated and a tool has been developed to mark the objects in the images so that the co-ordinates of the region bounding the object can be written to a separate descriptor file. Finally the descriptor file is used to extract vector images and passed to Haartraining process along with negative background images to build the classifier. Hence classifier files of different objects has been created and later used to recognize object in images rendered by Kinect.
Object Location Calculation
Every detected object in the system is annotated with a tuple of two values: the approximate distance between the object and the sensor and the angle the object makes with the z-axis of the sensor. These values are calculated using the positional information provided by the object detector and the depth data obtained from the Kinect. Now, the challenge was to get the correct distance of the located object. We can not only depend on the exact pixel co-ordinate of the detected object location to find out the distance. Kinect usually measure the distance using its infrared camera. Using this camera for each and every pixel located in the 2D image it provides the distance measure in a specified unit. Now, this might be a case that for some pixels it will fail to extract the distance value due to some environmental problem and provide a built in zero value in those pixels. Therefore, for the accurate extraction of the distance value we have followed an algorithm which will not be affected by the environmental problem that might lead to an invalid distance measure. The main idea of the algorithm is to take chunk of area of pixels dynamically around the detected object and find out the median of distances.
This part of the system is actually comprised of a Speech Synthesizer. The objective is to inform the visually impaired user in a friendly way regarding the location information of the objects in his surroundings. Therefore whatever result got during object recognition and location calculation is formulated in a structured text and passed to the speech synthesizer which delivers a voice output based on the text provided.