Skip to main content

The Cognitive Brain: Task Planning with LLMs

From Voice to Action Plans​

The Challenge: A voice command like "Clean the kitchen" is abstract. How does the robot know the exact sequence of actions?

Human interpretation:

  1. Navigate to kitchen
  2. Look for trash/clutter
  3. Identify objects to clean
  4. Pick up each object
  5. Place in trash/recycling/proper location
  6. Wipe surfaces
  7. Return to standby

Robot needs: Explicit, executable primitives like navigate(x, y), grasp(object_id), place(location)

Solution: Use an LLM (Large Language Model) as a cognitive planner that converts high-level commands into low-level action sequences.

Real-World Implementations
  • Google SayCan (2022): Uses PaLM LLM to plan household tasks for mobile manipulators
  • Microsoft ChatGPT + Robot (2023): GPT-4 generates Python code for robot control
  • Tesla Optimus Brain (2024): Custom LLM for humanoid task decomposition
  • Physical Intelligence Ο€0 (2024): Open-source VLA model with LLM planner

The LLM Planning Architecture​

graph TD
A[Voice Command: Clean the kitchen] --> B[Whisper STT]
B --> C[LLM Task Planner: GPT-4]
C --> D[Task Decomposition]
D --> E1[Step 1: navigate x=5 y=3]
D --> E2[Step 2: perceive objects=trash]
D --> E3[Step 3: grasp object_id=bottle]
D --> E4[Step 4: navigate x=6 y=2]
D --> E5[Step 5: place location=trash_bin]

E1 --> F[ROS 2 Action Server]
E2 --> F
E3 --> F
E4 --> F
E5 --> F

F --> G[Robot Execution]
G --> H{Success?}
H -->|Yes| I[Next Step]
H -->|No| J[Re-plan with LLM]
J --> C
I --> F

style C fill:#FF006B,stroke:#FF0080,stroke-width:2px,color:#fff
style F fill:#8B5CF6,stroke:#A78BFA,stroke-width:2px,color:#fff
style G fill:#00FFD4,stroke:#00F0FF,stroke-width:2px,color:#000

Key Components:

  1. LLM Planner (GPT-4, Llama 3.1): Converts abstract goal β†’ concrete steps
  2. Skill Library: Pre-defined robot primitives (navigate, grasp, place, search)
  3. State Monitor: Tracks execution progress and failures
  4. Re-planning Loop: If action fails, ask LLM for alternative plan

The Latency Trap: Cloud vs Edge​

Problem: Network Latency Kills Real-Time Robotics​

Cloud-based LLM (GPT-4 via API):

User command β†’ 50ms network β†’ 200ms GPT-4 inference β†’ 50ms network β†’ Robot
Total: 300ms (acceptable for planning, NOT for low-level control)

Edge-based LLM (Llama 3.1 8B on Jetson):

User command β†’ 0ms network β†’ 500ms local inference β†’ Robot
Total: 500ms (acceptable for planning)

BUT: Low-level control must run on-device at 100-1000 Hz (1-10ms latency).

Critical Design Rule

Cloud/Edge for Planning (1-10 Hz): Task decomposition, re-planning
Edge-only for Control (100-1000 Hz): Joint PD control, balance, collision avoidance

Never send low-level control commands over network! A 100ms network spike means the robot falls over.


Setting Up LLM on Jetson Orin​

Option 1: Cloud API (GPT-4 Turbo) - Easy but Requires Internet​

# Install OpenAI Python SDK
pip3 install openai

# Set API key
export OPENAI_API_KEY="sk-your-key-here"

Advantages:

  • βœ… Highest accuracy (GPT-4 Turbo = 90%+ task success)
  • βœ… No local compute required
  • βœ… Always updated to latest model

Disadvantages:

  • ❌ Requires internet (fails in offline environments)
  • ❌ 200-500ms latency
  • ❌ $0.01 per 1K tokens (~$0.001 per command)

Option 2: Local LLM (Llama 3.1 8B) - Edge Inference​

# Install llama.cpp for optimized inference
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make

# Download Llama 3.1 8B model (quantized to 4-bit for Jetson)
wget https://huggingface.co/TheBloke/Llama-3.1-8B-GGUF/resolve/main/llama-3.1-8b.Q4_K_M.gguf

# Test inference
./main -m llama-3.1-8b.Q4_K_M.gguf -p "You are a robot. Plan how to clean a kitchen." -n 256

Advantages:

  • βœ… Fully offline (no internet required)
  • βœ… Zero API costs
  • βœ… Data privacy (all processing on-device)

Disadvantages:

  • ❌ Lower accuracy than GPT-4 (80% vs 90% task success)
  • ❌ Requires 8GB VRAM (Jetson Orin Nano minimum)
  • ❌ 500-1000ms inference latency

Performance on Jetson Orin Nano:

ModelParametersQuantizationVRAMInference TimeAccuracy
Llama 3.1 8B8BQ4_K_M5 GB500ms80%
Phi-3 Mini3.8BQ4_K_M3 GB250ms75%
GPT-4 Turbo (Cloud)1.76T-0 GB200ms + network90%

Recommended: Llama 3.1 8B for offline robotics, GPT-4 for research/demos.


Complete LLM Task Planner Node​

File: llm_task_planner.py

#!/usr/bin/env python3
"""
LLM Task Planner Node
Converts natural language commands into executable robot action sequences
Author: Physical AI Course
Hardware: Jetson Orin Nano + GPT-4 API or Local Llama 3.1
"""

import rclpy
from rclpy.node import Node
from std_msgs.msg import String
from robot_interfaces.msg import TaskPlan, RobotAction # Custom messages

import openai
import json
import time

class LLMTaskPlanner(Node):
def __init__(self):
super().__init__('llm_task_planner')

# Subscribers
self.voice_sub = self.create_subscription(
String,
'/voice/command',
self.voice_command_callback,
10
)

# Publishers
self.plan_pub = self.create_publisher(TaskPlan, '/task_plan', 10)
self.status_pub = self.create_publisher(String, '/planner/status', 10)

# OpenAI API configuration
openai.api_key = "sk-your-key-here" # Replace with your key

# Robot skill library (available primitives)
self.skill_library = {
"navigate": {
"description": "Move robot to a location",
"parameters": ["x: float", "y: float", "theta: float"],
"example": "navigate(x=5.0, y=3.0, theta=0.0)"
},
"perceive": {
"description": "Detect objects in environment using camera",
"parameters": ["object_class: str"],
"example": "perceive(object_class='mug')"
},
"grasp": {
"description": "Pick up an object",
"parameters": ["object_id: int"],
"example": "grasp(object_id=42)"
},
"place": {
"description": "Put down held object at location",
"parameters": ["location: str"],
"example": "place(location='table')"
},
"search": {
"description": "Look around for object or location",
"parameters": ["target: str"],
"example": "search(target='trash_bin')"
}
}

# Environment knowledge (locations in the map)
self.locations = {
"kitchen": {"x": 5.0, "y": 3.0, "theta": 0.0},
"bedroom": {"x": 2.0, "y": 7.0, "theta": 1.57},
"living_room": {"x": 8.0, "y": 5.0, "theta": 3.14},
"trash_bin": {"x": 6.0, "y": 2.0, "theta": 0.0}
}

self.get_logger().info('🧠 LLM Task Planner ready!')

def voice_command_callback(self, msg):
"""Receive voice command and generate task plan"""
command = msg.data
self.get_logger().info(f'Received command: "{command}"')

# Update status
status_msg = String()
status_msg.data = f'Planning: {command}'
self.status_pub.publish(status_msg)

# Generate plan with LLM
plan = self.generate_plan(command)

if plan:
self.get_logger().info(f'Generated plan with {len(plan)} steps')

# Publish task plan
plan_msg = TaskPlan()
plan_msg.command = command
plan_msg.steps = plan
self.plan_pub.publish(plan_msg)

# Update status
status_msg.data = 'Plan ready'
self.status_pub.publish(status_msg)
else:
self.get_logger().error('Failed to generate plan')
status_msg.data = 'Planning failed'
self.status_pub.publish(status_msg)

def generate_plan(self, command):
"""Use GPT-4 to convert command into action sequence"""

# Construct prompt with robot capabilities
system_prompt = f"""You are a robot task planner. Convert high-level commands into sequences of robot actions.

Available skills:
{json.dumps(self.skill_library, indent=2)}

Known locations:
{json.dumps(self.locations, indent=2)}

Output format (JSON array):
[
{{"action": "navigate", "parameters": {{"x": 5.0, "y": 3.0, "theta": 0.0}}}},
{{"action": "perceive", "parameters": {{"object_class": "mug"}}}},
...
]

Rules:
1. Always navigate before attempting manipulation
2. Perceive before grasping (need object_id from perception)
3. Use known locations when available
4. If location unknown, add search step first
5. Keep plans simple (3-7 steps max)
"""

user_prompt = f"Command: {command}\n\nGenerate action sequence:"

try:
# Call GPT-4 API
start_time = time.time()

response = openai.ChatCompletion.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_prompt}
],
temperature=0.0, # Deterministic output
max_tokens=512
)

elapsed = time.time() - start_time
self.get_logger().info(f'GPT-4 response time: {elapsed:.2f}s')

# Parse JSON response
plan_text = response['choices'][0]['message']['content']

# Extract JSON (handles markdown code blocks)
if '```json' in plan_text:
plan_text = plan_text.split('```json')[1].split('```')[0]
elif '```' in plan_text:
plan_text = plan_text.split('```')[1].split('```')[0]

plan = json.loads(plan_text.strip())

# Validate plan structure
if not isinstance(plan, list):
raise ValueError("Plan must be a list of actions")

for step in plan:
if 'action' not in step or 'parameters' not in step:
raise ValueError("Each step must have 'action' and 'parameters'")

if step['action'] not in self.skill_library:
raise ValueError(f"Unknown action: {step['action']}")

return plan

except Exception as e:
self.get_logger().error(f'Plan generation error: {e}')
return None

def main(args=None):
rclpy.init(args=args)
node = LLMTaskPlanner()

try:
rclpy.spin(node)
except KeyboardInterrupt:
node.get_logger().info('Shutting down LLM planner...')
finally:
node.destroy_node()
rclpy.shutdown()

if __name__ == '__main__':
main()

Example: "Clean the Kitchen" Execution​

User Command​

"Clean the kitchen"

GPT-4 Generated Plan​

[
{
"action": "navigate",
"parameters": {"x": 5.0, "y": 3.0, "theta": 0.0},
"description": "Go to kitchen"
},
{
"action": "perceive",
"parameters": {"object_class": "trash"},
"description": "Find trash/clutter"
},
{
"action": "grasp",
"parameters": {"object_id": "${perception_result}"},
"description": "Pick up first item"
},
{
"action": "navigate",
"parameters": {"x": 6.0, "y": 2.0, "theta": 0.0},
"description": "Go to trash bin"
},
{
"action": "place",
"parameters": {"location": "trash_bin"},
"description": "Dispose trash"
},
{
"action": "navigate",
"parameters": {"x": 5.0, "y": 3.0, "theta": 0.0},
"description": "Return to kitchen"
}
]

Execution Log​

[INFO] Step 1/6: navigate(x=5.0, y=3.0) β†’ SUCCESS (8.2s)
[INFO] Step 2/6: perceive(object_class=trash) β†’ SUCCESS (1.5s, detected 3 objects)
[INFO] Step 3/6: grasp(object_id=42) β†’ SUCCESS (4.1s)
[INFO] Step 4/6: navigate(x=6.0, y=2.0) β†’ SUCCESS (5.3s)
[INFO] Step 5/6: place(location=trash_bin) β†’ SUCCESS (2.7s)
[INFO] Step 6/6: navigate(x=5.0, y=3.0) β†’ SUCCESS (5.1s)
[INFO] Task completed in 26.9 seconds

Safety Layer: Preventing Dangerous Commands​

Problem: LLMs can hallucinate harmful actions:

  • "throw the knife at the wall"
  • "move at 10 m/s indoors"
  • "grasp the hot stove"

Solution: Add rule-based safety checks before execution.

class SafetyValidator:
def __init__(self):
# Forbidden action combinations
self.blacklist = [
"grasp.*knife",
"navigate.*speed > 2.0", # Max 2 m/s indoors
"grasp.*stove",
"throw"
]

def validate_plan(self, plan):
"""Check if plan contains dangerous actions"""
for step in plan:
action_str = f"{step['action']} {json.dumps(step['parameters'])}"

# Check against blacklist
for pattern in self.blacklist:
if re.search(pattern, action_str, re.IGNORECASE):
return False, f"Unsafe action detected: {action_str}"

# Check velocity limits
if step['action'] == 'navigate':
if 'speed' in step['parameters'] and step['parameters']['speed'] > 2.0:
return False, "Speed exceeds safe limit (2 m/s)"

# Check workspace bounds
if step['action'] in ['navigate', 'place']:
x = step['parameters'].get('x', 0)
y = step['parameters'].get('y', 0)
if abs(x) > 10 or abs(y) > 10:
return False, f"Position ({x}, {y}) outside workspace"

return True, "Plan is safe"

# In LLMTaskPlanner class:
def generate_plan(self, command):
plan = # ... (LLM generation)

# Validate before publishing
validator = SafetyValidator()
is_safe, message = validator.validate_plan(plan)

if not is_safe:
self.get_logger().error(f'UNSAFE PLAN BLOCKED: {message}')
return None

return plan

Hands-On Exercise: Add "Bring Me" Command​

Challenge: Implement "Bring me the red mug" command.

Required Steps:

  1. Search for "red mug" in current room
  2. Navigate to object location
  3. Grasp object
  4. Navigate back to user
  5. Place object near user

Starter Code:

# Add to skill_library:
"handover": {
"description": "Give held object to human",
"parameters": ["approach_distance: float"],
"example": "handover(approach_distance=0.5)"
}

# Test prompt:
# "Bring me the red mug from the kitchen"

# Expected plan:
# [
# {"action": "navigate", "parameters": {"x": 5.0, "y": 3.0}},
# {"action": "perceive", "parameters": {"object_class": "mug", "color": "red"}},
# {"action": "grasp", "parameters": {"object_id": "${perception_result}"}},
# {"action": "navigate", "parameters": {"x": 0.0, "y": 0.0}}, # User location
# {"action": "handover", "parameters": {"approach_distance": 0.5}}
# ]

Key Takeaways​

βœ… LLMs decompose high-level commands into executable primitives
βœ… Cloud LLMs (GPT-4) have 90% task success but require internet
βœ… Edge LLMs (Llama 3.1 8B) run offline on Jetson with 80% success
βœ… Planning runs at 1-10 Hz, control runs at 100-1000 Hz (separate!)
βœ… Safety validation prevents dangerous actions from LLM hallucinations
βœ… Skill library defines robot capabilities as structured functions


What's Next?​

You've built the cognitive planner. The next chapter introduces Vision-Language-Action (VLA) modelsβ€”end-to-end models like RT-2 and OpenVLA that directly map camera images + text commands β†’ robot actions, bypassing separate planning and control.


Further Reading​