About This Research
This repository contains research on developing proactive robotic assistants capable of anticipating human needs and actions in complex, semi-structured environments. The project focuses on Human-Centric Digital Twins (HCDTs) that can predict human intent and generate corresponding 3D motion for enhanced human-robot collaboration.
Key Research Areas:
- Intent Prediction: Using AI foundation models to understand human intentions from multimodal cues
- Motion Generation: Creating physically plausible 3D human motion sequences based on predicted intent
- Context Awareness: Leveraging scene understanding and task knowledge for better predictions
- Modular Framework: Integrating pre-trained models for perception, reasoning, and motion synthesis
Technical Approach:
Our modular framework combines state-of-the-art components including:
- Vision-Language Models (VLMs) for scene understanding
- Large Language Models (LLMs) for high-level reasoning about tasks and intent
- Diffusion models for physics-aware human motion generation
- 3D pose estimation using SMPL-X representation
- Gaze tracking and attention analysis for intent cues
The system is designed to be particularly beneficial for Small and Medium Enterprises (SMEs) seeking adaptable human-robot collaboration solutions without requiring extensive end-to-end model retraining.
Getting Started
Quick Setup
To set up the HCDT framework on your system:
conda create -n hcdt python=3.10 -y
conda activate hcdt
# Install PyTorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
# Install dependencies
pip install -r requirements.txt
pip install pandas numpy matplotlib opencv-python pillow tqdm
Running Experiments
1. Comprehensive Model Evaluation
Execute experiments across multiple AI models and configurations:
./run_all_models.sh
Features:
- Tests Gemini (2.5-flash-lite, 2.5-flash, 2.5-pro) and Gemma (3-27b-it) models
- Runs Single, ICL, RCWPS, and Phase2 experiment types
- Evaluates Cooking, HAViD assembly, and Stack manipulation tasks
- Varies gaze usage, ground truth, and camera viewpoints
2. Results Analysis and Visualization
Generate comprehensive evaluation tables and visualizations:
python generate_results_table.py
# Process Phase 2 hand position predictions
python eval/batch_process_phase2.py
Outputs:
- CSV and LaTeX formatted results tables
- Temporal F1 score analysis
- Hand position prediction accuracy (NMPE metrics)
- Comparative visualizations across models
3. Data Processing Pipeline
Set up new experiments with automated preprocessing:
Pipeline includes:
- Video frame extraction using ffmpeg
- 3D pose estimation with GVHMR
- Pose format conversion via SMPLest-X
- Gaze target prediction using Gazelle
Requirements
- Hardware: NVIDIA GPU with CUDA support, 16GB+ RAM recommended
- Software: Python 3.10, PyTorch, OpenCV, ffmpeg
- Special Environments: Conda environments for gvhmr, smplestx, gazelle
- APIs: Google AI API access for Gemini models
Project Results
Visualization of our prediction results and framework performance:

Demo Videos
Watch our project demonstration videos: