HCDT Project - GitHub Pages

About This Research

This repository contains research on developing proactive robotic assistants capable of anticipating human needs and actions in complex, semi-structured environments. The project focuses on Human-Centric Digital Twins (HCDTs) that can predict human intent and generate corresponding 3D motion for enhanced human-robot collaboration.

Key Research Areas:

Intent Prediction: Using AI foundation models to understand human intentions from multimodal cues
Motion Generation: Creating physically plausible 3D human motion sequences based on predicted intent
Context Awareness: Leveraging scene understanding and task knowledge for better predictions
Modular Framework: Integrating pre-trained models for perception, reasoning, and motion synthesis

Technical Approach:

Our modular framework combines state-of-the-art components including:

Vision-Language Models (VLMs) for scene understanding
Large Language Models (LLMs) for high-level reasoning about tasks and intent
Diffusion models for physics-aware human motion generation
3D pose estimation using SMPL-X representation
Gaze tracking and attention analysis for intent cues

The system is designed to be particularly beneficial for Small and Medium Enterprises (SMEs) seeking adaptable human-robot collaboration solutions without requiring extensive end-to-end model retraining.

Getting Started

Quick Setup

To set up the HCDT framework on your system:

# Create conda environment

conda create -n hcdt python=3.10 -y

conda activate hcdt

# Install PyTorch with CUDA

conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia

# Install dependencies

pip install -r requirements.txt

pip install pandas numpy matplotlib opencv-python pillow tqdm

Running Experiments

1. Comprehensive Model Evaluation

Execute experiments across multiple AI models and configurations:

chmod +x run_all_models.sh

./run_all_models.sh

Features:

Tests Gemini (2.5-flash-lite, 2.5-flash, 2.5-pro) and Gemma (3-27b-it) models
Runs Single, ICL, RCWPS, and Phase2 experiment types
Evaluates Cooking, HAViD assembly, and Stack manipulation tasks
Varies gaze usage, ground truth, and camera viewpoints

2. Results Analysis and Visualization

Generate comprehensive evaluation tables and visualizations:

# Generate results tables

python generate_results_table.py

# Process Phase 2 hand position predictions

python eval/batch_process_phase2.py

Outputs:

CSV and LaTeX formatted results tables
Temporal F1 score analysis
Hand position prediction accuracy (NMPE metrics)
Comparative visualizations across models

3. Data Processing Pipeline

Set up new experiments with automated preprocessing:

python process_task.py --exp_name "YourExperiment"
            

Pipeline includes:

Video frame extraction using ffmpeg
3D pose estimation with GVHMR
Pose format conversion via SMPLest-X
Gaze target prediction using Gazelle

Requirements

Hardware: NVIDIA GPU with CUDA support, 16GB+ RAM recommended
Software: Python 3.10, PyTorch, OpenCV, ffmpeg
Special Environments: Conda environments for gvhmr, smplestx, gazelle
APIs: Google AI API access for Gemini models

Project Results

Visualization of our prediction results and framework performance:

Demo Videos

Watch our project demonstration videos: