Hand Tracking

Summary

hand_tracking and hand_tracking_gpu are plugins which detects hands in an image. The plugins integrate the Mediapipe hand landmark detectionE11 algorithm into the ILLIXR framework. The only difference between hand_tracking and hand_tracking_gpu are where the data are processed: CPU vs GPU. Their operation and interface are identical. ( Currently the GPU version is a work in progress.)

Switchboard connection

The hand_tracking plugin subscribes to the webcam to get the input images to process. Future development will allow this plugin to subscribe to other input types dynamically, depending on the users' needs. The plugin utilizes the following data structures

rect: representation of a rectangle
- x_center: x-coordinate of the rectangle center
- y_center: y-coordinate of the rectangle center
- width: width of the rectangle
- height: height of the rectangle
- rotation: rotation angle of the rectangle in radians
- normalized: boolean indicating the units; true indicates normalized units [0..1] of the input image, false indicates pixel units
- valid: boolean indicating whther the object is valid
point: representation of a 3D point
- x: x-coordinate
- y: y-coordinate
- z: z-coordinate (not an absolute distance, but a measure of the point's depth relative to other points)
- normalized: boolean indicating the units; true indicates normalized units [0..1] of the input image, false indicates pixel units
- valid: boolean indicating whther the object is valid

Note

All coordinates in these data are normalized to the input image size

The plugin published an ht_frame which contains the following data

detections: the raw information produced by the mediapipe code, x and y coordinates are normalized to the input image size, and z has no meaning
- left_palm: rect which encloses the left palm, if detected
- right_palm: rect which encloses the right palm, if detected
- left_hand: rect which encloses the entire left hand, if detected
- right_hand: rect which encloses the entire right hand, if detected
- left_confidence: float indicating the detection confidence of the left hand [0..1]
- right_confidence: float indicating the detection confidence of the right hand [0..1]
- left_hand_points: vector of the 21 point objects, one for each hand landmark, from the left hand, if detected
- right_hand_points: vector of the 21 point objects, one for each hand landmark, from the right hand, if detected
- img: cv::Mat in CV_8UC4 format (RGBA), representing the detection results
hand_positions: map of detected points for each hand, if depth cannot be determined then the value for that axis will have no meaning (axis will depend on the coordinate reference frame), coordinate origin is defined by the user at startup
hand_velocities: map of velocities for each detected point for each hand, requires that depth is known or calculated, and the last iteration of the code produced valid results, the units unit per second
offset_pose: a pose, that when removed from each point, will give coordinates relative to the camera
reference: the coordinate reference space (e.g. left hand y up)
unit: the units of the coordinate system

Info

The detections may be removed or re-worked in future releases

Each vector of hand points contains 21 items which reference the following ( from https://ai.google.dev/edge/mediapipe/solutions/vision/hand_landmarker)

hand_landmark_map

The landmark_points enum can be used to reference the individual points in each vector

ht.left_hand_points[THUMB_TIP]

will get the point for the tip of the left thumb.

Environment Variables

The hand tracking utilizes the following environment/yaml file variables to control its processing:

HT_INPUT: the type of images to be fed to the plugin. Values are
- zed
- cam (for typical cam_type/binocular images)
- webcam (single image)
HT_INPUT_TYPE: descriptor of what image(s) to use. Values are
- LEFT - only use the left eye image from an input pair
- SINGLE - same as LEFT
- RIGHT - only use the right eye image from an input pair
- MULTI - use both input images
- BOTH - same as MULTI
- RGB - only a single input image
WCF_ORIGIN: the origin pose of the world coordinate system as a string of three, four, or seven numbers. The numbers should be comma separated with no spaces.
- x,y,z - three coordinate version, representing the position of the origin pose (quaternion will be 1,0,0,0)
- w,wx,wy,wz - four coordinate version, representing the quaternion of the origin pose (position will be 0,0,0)
- x,y,z,w,wx,wy,wz - seven coordinate version, representing the full origin pose

Helper plugins

There are two additional plugins which are designed to aid in debugging the hand_tracking plugin.

Viewer

The hand_tracking.viewer plugin subscribes to the output of the hand_tracking plugin and displays the results, both graphically and in tabular format.

Webcam

The webcam plugin can feed single frame images to the hand tracking plugin.

OpenXR

The hand tracking plugin can be built with an OpenXR interface. To build the interface add -DBUILD_OXR_INTERFACE=ON to your cmake command line. The interface itself is in libopenxr_illixr_ht.so and is designed to be an API Layer. It installs a json file in the user's .local directory and is automatically detected by libopenxr_loader.so To use the layer you will need both an OpenXR application and runtime. This code is known to be compatible with the Monado runtime, and should be compatible with others. Currently, the hand tracking must receive that data from ILLIXR, but as an API Layer the resulting calculations can be retrieved via OpenXR API calls.

API

The hand tracking API can be found here.