Summary
The top level of the architecture is decomposed into capabilities that control different parts of the output, which results in the following benefits:
Speeds up the learning by reinforcement and reduces the memory usage Provides more flexibility in the behaviors instead of relying on default tactics Avoids the problem of executing multiple actions in RL by separating actions and splitting the reward signal
Each component essentially learns a specific capability based on mood and other information:
Gathering behaviors learn which type of object is desirable based on the inventory status. Statistical analysis is used to estimate the benefits of each object type by collecting immediate feedback (that is, no delayed reward). The movement component uses expert features specific to the situation to determine whether to pursue, evade, or perform other neutral behaviors. Q-learning allows the algorithm to take into account delayed reward. The shooting styles (for instance, aggressive, accurate) are learned with an episodic algorithm. Data is gathered during each fight, and the estimates are updated at the end. Expert features about the fight are also used. Other capabilities are not learned, because they are handled better with standard programming.
Because learning is high level, the decisions have more effect on intelligence of animats than their realism (which remains good throughout the adaptation). Of course, performance is poor before the learning starts, but the results are noticeable after a few iterations. It takes a few precautions before using the system to learn online (for instance, select an appropriate policy). However, most of these are included in the design (for instance, expert features, compact representation). The next chapter looks into the problem of dealing with adaptive behaviors in more depth.
|