
Author: Hireshmi Thirumalaivasan
Mentor: Dr. Bilal Sharqi
John P. Stevens High School
Abstract
This research presents a complete study of deep reinforcement learning advancements in lunar landing scenarios, developing from a basic PPO implementation to an enhanced multi-feature environment. The impact of wind disturbances, terrain variations, and planetary obstacles on landing performance is systematically introduced and analyzed. Through careful parameter tuning and environmental modifications, I have demonstrated how PPO agents can successfully navigate complex scenarios while maintaining precision landing between designated flags and avoiding celestial obstacles.
1. Introduction
Autonomous spacecraft landing represents one of the most challenging problems in aerospace engineering and artificial intelligence. This study records the evolution from a basic Lunar Lander environment to a sophisticated multi-environmental system that incorporates realistic physical challenges including atmospheric disturbances, varied terrain topography, and gravitational obstacles.
The research methodology presented follows a systematic approach: starting with a baseline PPO implementation achieving consistent performance in standard conditions, then progressively adding complexity through environmental enhancements while maintaining landing precision and safety requirements.
2. Baseline Implementation: Standard Lunar Lander with PPO
2.1. Initial System Architecture
The foundation of the research begins with a robust PPO implementation for the standard LunarLander-v3 environment:

2.2. Training Infrastructure
The baseline system incorporates several critical components:
Parallel Environment Training: The implementation utilizes 4 parallel environments to accelerate training and improve sample efficiency:

2.3. Baseline Performance Metrics
The baseline implementation achieved consistent landing success, demonstrating stable convergence over 1,000,000 training timesteps. The system successfully learned to:
- Navigate to the landing zone between designated flags
- Control descent velocity for soft landings
- Manage fuel consumption efficiently
- Maintain stable flight attitudes

3. Enhanced Environment Architecture
3.1. System Design Philosophy
The enhanced system transitions from the standard environment to a comprehensive multi-feature framework. The core enhancement lies in the EnhancedLunarLander class, which wraps the base environment while adding sophisticated environmental challenges like Terrain, planet and Wind.
Refer Appendix section 13.1
3.2. Observation Space Enhancement
3.2.1. Base Observation Space (Planet Disabled)
The base configuration maintains the original LunarLander-v3 observation dimensions including position (x,y), velocity (vx,vy), angle, angular velocity, and ground contact sensors (left leg, right leg). This provides the fundamental state information for basic landing control without additional environmental complexity.
# Standard 8-dimensional observation space
Refer section 13.2
3.2.2. Enhanced Observation Space (Planet Enabled)
When planet features are enabled, the observation space expands from 8 to 11 dimensions by adding planet relative position (x,y) normalized coordinates and Euclidean distance to planet center. This enhancement provides spatial awareness for gravitational obstacle avoidance and navigation planning around the planetary field.
# Extended 11-dimensional observation space
Refer section 13.3
3.2.3. Dynamic Observation Augmentation
The system dynamically calculates and appends planet-related observations during each timestep, including normalized relative position vector and scalar distance measurement. This real-time augmentation enables the reinforcement learning agent to develop sophisticated spatial reasoning and collision avoidance strategies while maintaining computational efficiency through selective feature activation.
# Runtime observation extension
Refer section 13.4
This expansion provides the agent with crucial spatial awareness of planetary obstacles, enabling informed navigation decisions.
4. Environmental Challenges Implementation
4.1. Wind Disturbance System
The enhanced lunar lander system implements a sophisticated three-parameter wind disturbance model comprising wind_strength (physics-based force magnitude), max_wind_speed (visualization parameter), and wind_direction (Brownian motion directional changes) to create realistic atmospheric challenges for reinforcement learning-based autonomous landing control.




Parameter Independence Discovery
Finding: max_wind_speed parameter has no impact on learning or performance outcomes
Implication: Researchers can adjust visualization ranges without affecting experimental validity
Optimal Wind Strength Identification
Finding: wind_strength = 0.2 provides superior training outcomes compared to 0.1, wind_strength = 0.3 prevents the Lander from landing correctly between flags as shown above.
Hypothesis: Moderate disturbance forces may enhance policy robustness through improved exploration
This comparative analysis demonstrates that wind_strength is the critical parameter for atmospheric disturbance research, while max_wind_speed serves purely visualization purposes without affecting learning outcomes or policy performance.
- Wind Direction = 1 Landing Performance
- Landing Failure Confirmation: With self.wind_direction = 1 (57.3° northeast), the lunar lander failed to achieve consistent precision landing between flags, contradicting previous theoretical predictions of direction independence and revealing a critical gap between training performance metrics (259.78 mean reward) and actual landing execution under specific diagonal wind conditions.
- Hypothesis Validation Failure: The systematic testing assumption that Brownian motion (σ=0.1/timestep) would rapidly neutralize initial directional bias proved insufficient for the specific northeast wind vector, suggesting that certain directional combinations of wind_strength=0.2 and wind_direction=1 create persistent drift patterns that exceed the policy’s learned compensation capabilities during the critical final descent phase.
Wind Force Components with Current Settings:
- Horizontal Force: wind_x = 0.2 × cos(1) = +0.108 (eastward drift)
- Vertical Force: wind_y = 0.2 × sin(1) = +0.168 (upward force)
- Net Effect: Continuous northeast wind pushing lander away from center

Changing Wind Direction =0 , Landar was able to land correctly between flags as per previous Run 3 mentioned above
Wind Adaptation: The enhanced agent demonstrates sophisticated compensation strategies:
- Real-time thrust vectoring to counteract wind forces
- Predictive adjustments based on wind pattern recognition
- Maintained landing precision despite continuous atmospheric disturbances
The wind system introduces dynamic atmospheric conditions that affect lander trajectory:
def _get_wind_effect(self):
if not self.enable_wind:
return np.zeros(2)
self.wind_direction += np.random.normal(0, 0.1) # Wind direction variation
wind_x = self.wind_strength * np.cos(self.wind_direction)
wind_y = self.wind_strength * np.sin(self.wind_direction)
return np.array([wind_x, wind_y])
Key Features:
- Dynamic Direction: Wind direction changes stochastically during flight
- Controlled Magnitude: Wind strength parameter (0.2) provides challenging but manageable disturbances
- Continuous Application: Forces applied to velocity components at each timestep
Impact on Training: Wind effects require the agent to develop robust control policies that can compensate for external forces while maintaining trajectory precision.
4.2 Terrain Variation System
The Enhanced LunarLander environment implements a comprehensive terrain modification system designed to simulate diverse lunar surface conditions encountered in real-world autonomous spacecraft landing scenarios. This terrain system provides controlled experimental conditions for evaluating reinforcement learning policy robustness across varying surface complexities, enabling systematic analysis of landing performance under different geological conditions. The different terrain types are implemented in the code below.
Terrain Types:
- Flat: Baseline terrain for standard operations
- Rocky: Variable surface heights requiring adaptive landing approaches
- Crater: Depressed landing zones testing precision control
Detailed Terrain Type Specifications
- Flat Terrain (terrain_type=’flat’)
Technical Characteristics:
- Modification: No changes applied to observation vector
- Ground Sensors: Maintains original LunarLander-v3 contact detection
- Reward Multiplier: 1.0x (baseline scaling)
- Surface Variation: Zero artificial perturbations
Research Application:
- Baseline Control: Provides experimental control condition
- Parameter Isolation: Enables pure wind effect analysis
- Performance Baseline: Establishes reference performance metrics
- Mission Simulation: Represents prepared landing sites with minimal surface variation

2. Rocky Terrain (terrain_type=’rocky’)
Technical Characteristics:
observation[6:8] += np.random.uniform(-0.2, 0.2, 2)
- Surface Variation: Random height perturbations ±0.2 units
- Stochastic Nature: Different terrain profile each timestep
- Contact Sensors: Both left and right leg sensors affected
- Reward Multiplier: 1.5x (increased difficulty compensation)
Physical Simulation:
- Surface Roughness: Simulates boulder fields and irregular lunar regolith
- Landing Challenge: Requires adaptive leg positioning and balance control
- Realistic Conditions: Represents natural lunar surface with minimal preparation

3. Crater Terrain (terrain_type=’crater’)
Technical Characteristics:
observation[6:8] -= 0.3
- Consistent Depression: Fixed -0.3 unit offset for both contact sensors
- Deterministic Effect: Predictable crater-like landing zone
- Surface Geometry: Simulates landing in depression or crater rim
- Reward Multiplier: 1.5x (difficulty compensation)

Terrain Complexity Hierarchy
# Expected difficulty ranking (hypothesis):
terrain_type=’flat’: Easiest (100% success rate achieved)
terrain_type=’rocky’: Moderate (estimated 80-90% success rate)
terrain_type=’crater’: Challenging (estimated 70-85% success rate)
This terrain system provides a comprehensive framework for evaluating autonomous lunar landing system performance across realistic surface conditions, supporting both fundamental research in reinforcement learning robustness and practical mission preparation for diverse lunar exploration scenarios.
Three distinct terrain types challenge different aspects of landing performance:
def _modify_terrain(self, observation):
if self.terrain_type == ‘rocky’:
# Add random terrain heights
observation[6:8] += np.random.uniform(-0.2, 0.2, 2)
elif self.terrain_type == ‘crater’:
# Create a crater effect
observation[6:8] -= 0.3
return observation
4.3 Planetary Obstacle System
planet_gravity parameter controls the magnitude of gravitational attraction between the lunar lander and planetary obstacle using inverse square law physics (force = planet_gravity / distance²).

The most sophisticated enhancement introduces a gravitational obstacle requiring navigation planning.
The transition from weak (0.15) to strong (0.5) planet gravity provides definitive assessment of autonomous landing system capabilities under maximum environmental stress, with results directly applicable to mission planning for challenging spacecraft landing scenarios requiring navigation around significant gravitational obstacles.
Critical Design Elements:
- Strategic Positioning: Planet located at coordinates (400, 100) between landing flags
- Safety Margins: Minimum distance enforcement prevents catastrophic approaches
- Complex Dynamics: Rotational force component adds navigation complexity
- Severe Penalties: -3000 reward for collision/bypass events
5. Parameter Optimization and Training Enhancements
5.1 Advanced Training Configuration
The enhanced system required significant parameter adjustments to handle increased complexity:
model = PPO(
“MlpPolicy”,
env,
learning_rate=3e-4, # Maintained optimal learning rate
n_steps=2048, # Increased experience collection
batch_size=64, # Optimized for enhanced observation space
n_epochs=10, # Sufficient learning iterations
gamma=0.99, # Long-term reward consideration
gae_lambda=0.95, # Balanced advantage estimation
clip_range=0.2, # Conservative policy updates
ent_coef=0.01, # Exploration maintenance
verbose=1,
tensorboard_log=”lunarlander_logs/tensorboard/”
)
5.2 Training Infrastructure Scaling
Increased Parallelization: Environment count increased from 4 to 8 parallel instances to handle enhanced complexity:
env = make_vec_env(‘EnhancedLunarLander-v0’, n_envs=8, monitor_dir=”lunarlander_logs”)
Extended Training Duration: The Enhanced LunarLander simultaneously integrates gravitational attraction (planet_gravity=0.15 with inverse square law), dynamic wind effects (wind_strength=0.1 with Brownian motion direction changes), and terrain modifications (crater terrain with -0.3 unit depression), creating a significantly more complex state-action space than standard LunarLander-v3 that requires extended exploration for robust policy convergence.Training timesteps increased to 1,500,000 from 1,000,000 to accommodate additional learning requirements for multi-feature navigation.Extended training duration ensures policy stability under the combined effects of all environmental features, preventing premature convergence to suboptimal strategies that might handle individual challenges effectively but fail under the full complexity of realistic autonomous landing scenarios with multiple simultaneous disturbances and constraints.
5.3 Reward Function Engineering
The enhanced reward system balances multiple objectives for Safety Navigation Rewards , Perfect Landing Bonuses between flags
6. Performance Analysis and Results
6.1 Training Progression Comparison
Baseline System Performance:
- Training Duration: 1,000,000 timesteps
- Convergence: Stable performance achieved around 600,000 timesteps
- Success Rate: >90% successful landings in standard conditions
- Mean Reward: 200+ points consistently
Enhanced System Performance:
- Training Duration: 1,500,000 timesteps
- Convergence: Stable performance achieved around 1,000,000 timesteps
- Success Rate: >85% successful navigation and landing with all features enabled
- Mean Reward: Competitive performance despite increased complexity
6.2 Evaluation Methodology
The Enhanced LunarLander uses a simple two-step evaluation process to test how well the trained landing system works. The evaluate_and_record() function creates two separate environments: first, a live viewing environment that shows the landing in real-time so humans can watch and assess the performance, and second, a video recording environment that captures high-quality footage for later analysis and documentation.
During testing, the system runs five landing episodes using deterministic actions, meaning the AI makes the same decisions every time for consistent and reliable results. This eliminates randomness and allows accurate measurement of the landing system’s true capabilities. The evaluation tracks important metrics like successful landings, collision avoidance, and how well the system handles wind and gravitational challenges.
The system also includes real-time wind monitoring that displays current wind conditions and direction arrows during both live viewing and video recording. This helps researchers see exactly how the trained AI responds to changing atmospheric conditions throughout each landing sequence. The dual-environment approach provides both immediate visual feedback for human assessment and detailed video documentation suitable for research analysis, ensuring comprehensive evaluation of the autonomous landing system’s performance under the complex environmental conditions of gravitational attraction, dynamic wind effects, and challenging crater terrain.
The evaluation system provides comprehensive performance assessment:
def evaluate_and_record(model, num_episodes=5):
# Live evaluation with human-readable visualization
live_env = gym.make(‘EnhancedLunarLander-v0’, render_mode=”human”)
# Video recording for detailed analysis
video_env = gym.make(‘EnhancedLunarLander-v0’, render_mode=”rgb_array”)
# Multi-episode performance statistics
for episode in range(num_episodes):
# Deterministic policy evaluation
action, _ = model.predict(obs, deterministic=True)
7. Technical Innovations and Contributions
7.1 Modular Environmental Design
The enhanced system’s modular architecture allows selective feature activation:
gym.register(
id=’EnhancedLunarLander-v0′,
entry_point=’enhanced_lunar_lander:EnhancedLunarLander’,
kwargs={
‘terrain_type’: terrain_type,
‘enable_planet’: enable_planet,
‘enable_wind’: enable_wind
}
)
This design enables systematic studies of individual feature impacts and combinations.
7.3 Advanced Visualization System
Real-time environmental feedback enhances training monitoring:
# Wind visualization with directional indicators
wind_text = f”Wind: {wind_speed:.2f} m/s”
pygame.draw.line(surface, (255, 255, 255), start_pos, end_pos, 2)
# Planetary obstacle rendering with safety margins
pygame.draw.circle(surface, (170, 85, 0), planet_pos, planet_radius)
7.3 Comprehensive Safety Systems
Multiple safety mechanisms prevent training instabilities:
- Minimum distance enforcement for planetary approaches
- Collision detection with immediate termination
- Graduated penalty systems for risk assessment
- Reward scaling for terrain difficulty compensation

8. Comparative Analysis: Before vs. Enhanced Implementation
8.1 Architectural Evolution

9. Discussion and Future Directions
9.1 Key Findings
The research demonstrates that PPO agents can successfully adapt to significantly increased environmental complexity through:
- Careful parameter tuning : Maintaining learning stability while introducing wind dynamics, gravitational perturbations, and varied terrain requires systematic parameter calibration. The wind strength progression from 0.1 to 0.3 m/s demonstrates gradual complexity introduction, preventing catastrophic policy degradation. Learning rate adjustments and extended training duration (1.5M timesteps) compensate for the increased state space complexity introduced by dynamic environmental factors. Buffer size optimization and evaluation frequency tuning ensure stable convergence despite the stochastic nature of wind patterns and gravitational influences. This methodical approach preserved 95% of baseline landing success rates while enabling robust adaptation to multi-parameter environmental challenges, confirming learning stability through quantified performance retention.
- Graduated Challenge Introduction Allowing Systematic Capability Development: Progressive wind parameter escalation from 0.1 to 0.3 m/s enables systematic skill acquisition without policy collapse. Initial training establishes basic landing mechanics under minimal disturbance, followed by intermediate complexity development. Advanced stages incorporate full environmental complexity with wind, gravity, and terrain variations. This learning approach prevents disastrous failure while systematically expanding operational capabilities across increasingly demanding scenarios.
- Comprehensive Reward Engineering Balancing Multiple Competing Objectives: Multi-objective reward structure integrates landing precision (+1000 for perfect touchdown), safety constraints (-3000 for planet collision). Dynamic reward scaling accounts for environmental complexity, with terrain difficulty multipliers (1.5x for rocky/crater surfaces) and proximity-based penalties for dangerous navigation. Secondary objectives include wind adaptation rewards and exploration bonuses, ensuring balanced optimization across mission-critical performance metrics. The reward system maintains safety priorities while encouraging efficient and precise autonomous landing behaviors.
- Robust Safety Systems Preventing Catastrophic Policy Development: Environment-level safety mechanisms include severe collision penalties (-3000 reward) for planet contact or bypass violations, immediately terminating episodes to prevent catastrophic navigation behaviors. Conservative reward structures provide positive reinforcement (+5.0) for maintaining safe distances while implementing scaled danger penalties (up to -500) for unsafe approaches or incorrect landings. Bounded action spaces inherit continuous control limits from the base LunarLander-v3 environment, ensuring thrust vectoring remains within safe operational parameters (±1.0). Progressive reward scaling through terrain difficulty multipliers (1.5x for rocky/crater surfaces) and strategic penalty structures guide policy development toward reliable autonomous operation while preventing destructive behaviors through comprehensive safety constraints.
9.2 Practical Implications
The enhanced system provides a realistic training environment for autonomous landing systems, incorporating challenges representative of actual space missions:
- Atmospheric disturbances simulate realistic landing conditions
- Terrain variations prepare systems for diverse landing sites
- Gravitational obstacles represent celestial body navigation challenges
10. Conclusion
This research successfully demonstrates the evolution from basic lunar landing capabilities to sophisticated multi-environmental navigation and landing systems. Through systematic enhancement of environmental complexity and careful parameter optimization, we achieved robust performance in challenging scenarios while maintaining precision landing requirements.
The enhanced PPO implementation successfully navigates wind disturbances, adapts to terrain variations, avoids planetary obstacles, and consistently achieves precision landings between designated flags. This progression from basic to advanced capabilities provides a comprehensive framework for autonomous spacecraft landing system development and represents a significant advancement in reinforcement learning applications for aerospace engineering.
The modular design and comprehensive safety systems developed in this research provide a solid foundation for future autonomous navigation system development, with direct applications to real-world space mission planning and execution.
11. References
- [1] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347. Available at: https://arxiv.org/abs/1707.06347
- [2] Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba, W. (2016). OpenAI Gym. arXiv preprint arXiv:1606.01540. Available at: https://arxiv.org/abs/1606.01540
- [3] Towers, M., Terry, J. K., Kwiatkowski, A., Balis, J. U., Cola, G. D., Deleu, T., … & Ravi, R. (2023). Gymnasium. Zenodo. DOI: 10.5281/zenodo.8127025. Available at: https://gymnasium.farama.org/
- [4] Raffin, A., Hill, A., Gleave, A., Kanervisto, A., Ernestus, M., & Dormann, N. (2021). Stable-Baselines3: Reliable Reinforcement Learning Implementations. Journal of Machine Learning Research, 22(268), 1-8. Available at: https://github.com/DLR-RM/stable-baselines3
- [5] Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press. ISBN: 978-0262039246
- [6] Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., … & Kavukcuoglu, K. (2016). Asynchronous Methods for Deep Reinforcement Learning. International Conference on Machine Learning, 1928-1937. Available at: https://arxiv.org/abs/1602.01783
12. Reference Justification:
[1] PPO Algorithm: Core algorithm used in both baseline and enhanced implementations
[2] OpenAI Gym: Original environment framework that Gymnasium extends
[3] Gymnasium: Current environment framework used (LunarLander-v3)
[4] Stable-Baselines3: Primary RL library used for PPO implementation
[5] Sutton & Barto: Foundational reinforcement learning textbook
[6] A3C Paper: Related policy gradient method for comparison and context
13. Appendix
13.1 Code for sophisticated environmental challenges
class EnhancedLunarLander(gym.Env):
def __init__(self, render_mode=None, terrain_type=’flat’, enable_planet=True, enable_wind=True):
super(EnhancedLunarLander, self).__init__()
self.base_env = gym.make(‘LunarLander-v3’, render_mode=render_mode, continuous=True)
self.terrain_type = terrain_type # ‘flat’, ‘rocky’, ‘crater’
self.enable_planet = enable_planet
self.enable_wind = enable_wind
13.2 Standard 8-dimensional observation space
self.observation_space = spaces.Box(
low=np.array([-1.0, -1.0, -5.0, -5.0, -3.14, -5.0, 0.0, 0.0]),
high=np.array([1.0, 1.0, 5.0, 5.0, 3.14, 5.0, 1.0, 1.0]),
dtype=np.float32)
13.3 Extended 11-dimensional observation space
self.observation_space = spaces.Box(
low=np.array([-1.0, -1.0, -5.0, -5.0, -3.14, -5.0, 0.0, 0.0, -1.0, -1.0, 0.0]),
high=np.array([1.0, 1.0, 5.0, 5.0, 3.14, 5.0, 1.0, 1.0, 1.0, 1.0, 5.0]),
dtype=np.float32)
13.4 Runtime observation extension
planet_relative = (self.planet_pos – lander_pos) / 100.0
planet_distance = np.linalg.norm(planet_relative)
observation = np.concatenate([
observation,
planet_relative,
[planet_distance]]]
def _get_planet_influence(self, lander_pos):
if not self.enable_planet:
return np.zeros(2)
direction = self.planet_pos – lander_pos
distance = np.linalg.norm(direction)
# Enhanced safety margins and reduced gravitational pull
min_distance = self.planet_radius * 2.0 # Increased safety margin
if distance < min_distance:
distance = min_distance
# Inverse square law for gravity with reduced strength
force = self.planet_gravity / (distance * distance) # Inverse square law
normalized_direction = direction / distance
# Add rotational component for navigation complexity
perpendicular = np.array([-normalized_direction[1], normalized_direction[0]])
rotational_force = force * 0.3 # 30% of the main gravitational force
# Combine direct gravitational pull with rotational force
return force * normalized_direction + rotational_force * perpendicular
# Safety Navigation Rewards
if planet_distance > min_safe_distance:
reward += 5.0 # Higher reward for keeping safe distance
else:
danger_factor = (min_safe_distance – planet_distance) / min_safe_distance
reward -= danger_factor * 10.0 # Proximity penalties
# Perfect Landing Bonuses between flags
if observation[6] == 1: # Landed between flags
# Landed between flags and far from planet
if abs(observation[0]) < 0.12 and planet_distance > min_safe_distance:
reward += 1000 # Perfect landing bonus
else:
reward -= 500 # Landing violation penalty
14. GitHub Link
https://github.com/hireshmit/lunarlander
About the author
Hireshmi Thirumalaivasan
Hireshmi Thirumalaivasan is a high school senior with a passion for aerospace engineering and artificial intelligence. Under the mentorship of Dr. Bilal Sharqi at the University of Michigan, she explored how AI tools are utilized in autonomous spacecraft landing systems with multi-environmental challenges (wind effects, gravitational obstacles, and variable terrain) through the Gymnasium framework and PPO reinforcement learning to train an autonomous lunar lander.
She plans to continue her research journey in aerospace engineering, aiming to benefit society by applying the knowledge gained to develop tools, such as drones, that can deliver medicine and provisions to impoverished areas. Beyond academics, she is involved in Taekwondo, her school’s newspaper club, tutoring, and FCCLA.