This book deals with the creation of the algorithm backbone that enables a computer to perceive humans in a monitored space, by processing the signals that humans use in order to perform a task, i.e., audio and video. To do so, computers use sensors and algorithms to detect and track multiple interacting humans, by way of their faces and hands or their voices. This application domain is challenging, because audio and visual signals are cluttered by both background and foreground objects. After establishing particle filtering as the framework for tracking, audio, visual and also audiovisual tracking are then explained. Each method is analyzed, starting with sensor configurations, detection for tracker initialization and the trackers themselves. Techniques to fuse the modalities are then considered. Instead of offering a monolithic approach to the tracking problem, this book also focuses on implementation by providing MATLAB code for every presented component. This way, the reader can connect every presented concept with a corresponding piece of code that follows immediately after the theory. Finally, the applications of the various tracking systems in different domains are studied and considered.