To recognize gestures, you need to capture images from a video camera and do an image recognition algorithm on it. Image recognition is easy if you know what object you are looking for. It basically involves colors. For example, assume you are looking for a green box in a black background. In your algorithm, you just need to keep processing the image (RGB values) and keep looking for the color green. Now there are two steps to this process:
1) image recognition: here you look for object until it appears in your frame
2) finding the position and size of the object: once you know that you have found your object, you begin to find its dimensions. Then, using basic trigonometry, you find its position