1 Introduction
In reinforcement learning (RL), an autonomous agent learns to solve a task by interacting with its environment sutton1998reinforcement thus receiving a reward that has to be handengineered by an expert to efficiently guide the agent. However, when the reward is sparse or not available, the agent keeps having an erratic behavior.
This issue partially motivated methods where an agent hierarchically commits to a temporally extended behavior named skills dblpAubretMH20 or options sutton1999between, thus avoiding erratic behaviors and easing the exploration of the environment dblpMachadoBB17; aubret2019survey. While it is possible to learn hierarchical skills with an extrinsic (i.e taskspecific) reward bacon2017option; dblpRiemerLT18, it does not fully address the exploration issue. In contrast, if an agent learns skills in a bottomup way with intrinsic rewards, it makes the skills taskindependent aubret2019survey: they are learnt without access to an extrinsic reward. It follows that, to explore without extrinsic rewards, the intrinsic objective of an agent may be to acquire a large and diverse repertoire of skills. Such paradigm contrasts with the way predictionbased dblpBurdaEPSDE19 or countbased dblpBellemareSOSSM16 methods explore aubret2019survey since their inability to return to the frontier of their knowledge can make their exploration collapse ecoffet2021first. Previous approaches dblpNairPDBLL18; dblpPongDLNBL20; dblpEysenbachGIL19 manage to learn a large set of different skills, yet they are learnt during a developmental period metzen2013incremental, which is just an unsupervised pretraining phase.
These approaches based on a pretraining phase are incompatible with a continual learning framework where the agent has to discover increasingly complex skills according to sequentially given tasks SilverYL13; parisi2019continual. Ideally, a continually learning agent should learn new diverse skills by default, and focus on extrinsically rewarding skills when it gets extrinsic rewards pitis2020maximum; dblpAubretMH20. It would make exploration very efficient since diverse skill learning can maximize the state entropy hazan2019provably; dblpPongDLNBL20. To follow such principle, we assume that the agent should not focus on what to learn, as commonly done in RL, but rather on where to learn in its environment. Thus the agent should learn a representation that keeps the structure of the environment to be able to select where it is worth learning, instead of learning a representation that fits a wanted behavior which is most of the time unknown.
In this paper, we introduce a new way to learn a goal space and select the goals, keeping the learning process endtoend. We propose a new model that progressively Discovers a Topological representation (DisTop) of the environment. DisTop bridges the gap between acquiring skills that reach a uniform distribution of embedded states, and solving a task. It makes diversitybased skill learning suitable for endtoend exploration in a singletask setting irrespective of the ground state space. DisTop simultaneously optimizes three components: 1 it learns a continuous representation of its states using a contrastive loss that brings together consecutive states; 2 this representation allows the building of a discrete topology of the environment with a new variation of a Growing When Required network MarslandSN02. Using the clusters of the topology, DisTop can sample from an almostarbitrary distribution of visited embedded states, without approximating the highdimensional ground state distribution; 3 it trains a goalconditioned deep RL policy to reach embedded states. Upon these 3 components, a hierarchical stateindependent Boltzmann policy selects the cluster of skills to improve according to an extrinsic reward and a diversitybased intrinsic reward.
We show, through DisTop, that the paradigm of choosing where to learn in a learnt topological representation is more generic than previous approaches. Our contribution is 4folds: 1 we visualize the representation learnt by DisTop and exhibit that, unlike previous approaches dblpPongDLNBL20, DisTop is agnostic to the shape of the ground state space and works with states being images, highdimensional proprioceptive data or highdimensional binary data; 2 we demonstrate that DisTop clearly outperforms ELSIM, a recent method that shares similar properties dblpAubretMH20; 3 we show that DisTop achieves similar performance with stateoftheart (SOTA) algorithms on both singletask dense rewards settings haarnoja2018soft and multiskills learning benchmarks dblpPongDLNBL20; 4 we show that it improves exploration over stateoftheart hierarchical methods on hard hierarchical benchmarks.
2 Background
2.1 Goalconditioned reinforcement learning for skill discovery
In episodic RL, the problem is modelled with a Markov Decision Process (MDP). An agent can interact with an unknown environment in the following way: the agent gets a state
that follows an initial state distribution . Its policy selects actions to execute depending on its current state , . Then, a new state is returned according to transition dynamics distribution . The agent repeats the procedure until it reaches a particular state or exceeds a fixed number of steps . This loop is called an episode. An agent tries to adapt to maximize the expected cumulative discounted reward where is the discount factor. The policycan be approximated with a neural network parameterized by
haarnoja2018soft. In our case, the agent tries to learn skills, defined as a policy that tries to target a state (which we name goal). To this end, it autonomously computes an intrinsic reward according to the chosen skill aubret2019survey. The intrinsic reward is approximated through parameters . Following Universal Value Function Approximators schaul2015universal, the intrinsic reward and the skill’s policy become dependent on the chosen goal : and . Hindsight Experience Replay dblpAndrychowiczCRS17 is used to recompute the intrinsic reward to learn on arbitrary skills.In particular, SkewFit
dblpPongDLNBL20 strives to learn goalconditioned policies that visit a maximal entropy of states. To generate goalstates with high entropy, they learn a generative model that incrementally increases the entropy of generated goals and makes visited states progressively tend towards a uniform distribution. Assuming a parameterized generative model is available, an agent would like to learn to maximize the loglikelihood of the state distribution according to a uniform distribution: . But sampling from the uniform distribution is hard since the set of reachable states is unknown. SkewFit uses importance sampling and approximates the true ratio with a skewed distribution of the generative model where . This way, it maximizes where are uniformly sampled from the buffer of the agent. defines the importance given to lowdensity states; if , the distribution will stay similar, if , it strongly overweights lowdensity states to approximate . In practice, they show that directly sampling states , withbeing the normalization constant reduces the variance.
2.2 Contrastive learning and InfoNCE
Contrastive learning aims at discovering a similarity metric between data points chopra2005learning. First, positive pairs that represent similar data and fake negative pairs are built. Second they are given to an algorithm that computes a data representation able to discriminate the positive from the negative ones. InfoNCE dblpjournals/corr/abs180703748 proposed to build positive pairs from states that belong to the same trajectory. The negative pairs are built with states randomly sampled. For a sampled trajectory, they build a set that concatenates negative states and a positive one . Then, they maximize where is a learnt aggregation of previous states, indicates the additional number of steps and is a positive similarity function. The numerator brings together states that are part of the same trajectory and the denominator pushes away negative pairs. InfoNCE maximizes a lower bound of the mutual information where
defines the quantity of information that a random variable contains on another.
3 Method
We propose to redefine the interest of an agent so that its default behavior is to learn skills that achieve a uniform distribution of states while focusing its skill discovery in extrinsically rewarding areas when a task is provided. If the agent understands the true intrinsic environment structure and can execute the skills that navigate in the environment, a hierarchical policy would just have to select the state to target and to call for the corresponding skill. However, it is difficult to learn both skills and the structure. First the ground representation of states may give no clue of this structure and may be highdimensional. Second, the agent has to autonomously discover new reachable states. Third, the agent must be able to easily sample goals where it wants to learn.
DisTop tackles theses challenges by executing skills that target lowdensity or previously rewarding visited states in a learnt topological representation. Figure 1 illustrates how DisTop learns a topological representation. It samples an interaction from different buffers (see below), where is the original goalstate, and embeds the associated observations with a neural network . Then, is trained with the contrastive loss that guarantees the proximity between consecutive states (cf. §3.1
). After that, our growing network dynamically clusters the embedded states in order to uniformly cover the embedded state distribution. DisTop approximates the probability density of visiting a state with the probability of visiting its cluster (cf. §
3.2). Then, it assigns buffers to clusters, and can sample goalstates or states with almostarbitrary density over the support of visited states, e.g the uniform or rewardweighted ones (cf. §3.3).At the beginning of each episode, a stateindependent Boltzmann policy selects a cluster; then a state is selected that belongs to the buffer of the selected cluster and finally compute its representation (see §3.3). A goalconditioned policy , trained to reach , acts in the environment and discovers new reachable states close to its goal. The interactions made by the policy are stored in a buffer according to their initial objective and the embedding of the reached state.
3.1 Learning a representation with a metric
DisTop strives to learn a state representation that reflects the topology of its environment. We propose to maximize the constrained mutual information between the consecutive states resulting from the interactions of an agent. To do this, DisTop takes advantage of the InfoNCE loss (cf. §2.2). In contrast with previous approaches dblpjournals/corr/abs191213414; dblpjournals/corr/abs180703748; li2021learning, we do not consider the whole sequence of interactions since this makes it more dependent on the policy. Typically, a constant and deterministic policy would lose the structure of the environment and could emerge once the agent has converged to an optimal deterministic policy. DisTop considers its local variant and builds positive pairs with only two consecutive states. This way, it keeps distinct the states that cannot be consecutive and it more accurately matches the topology of the environment. We propose to select our similarity function as where is a neural network parameterized by , is the L2 norm and is a temperature hyperparameter chen2020simple. If is a batch of different consecutive pairs, the local InfoNCE objective, LInfoNCE, is described by eq. 1 (cf. Appendix C).
(1) 
We introduce the lower bound since it stabilizes the learning process. Intuitively, eq. 1 brings together states that are consecutive, and pushes away states that are separated by a large number of actions. This is illustrated in the second and third steps of Figure 1. There are several drawbacks with this objective function: the representation may be strongly distorted, making a clustering algorithm inefficient, and may be too unstable to be used by a DRL algorithm. To tackle this, eq. 2 reformulates the objective as a constrained maximization. Firstly, DisTop guarantees that the distance between consecutive states is below a threshold (first constraint of eq. 2). Secondly, it forces our representation to stay consistent over time (second constraint of eq. 2). Weights are a slow exponential moving average of .
(2)  
Transforming the constraints into penalties, we get our final objective in eq. 3. is the temperature hyperparameter that brings closer consecutive states, is the coefficient that slows down the speed of change of the representation.
(3) 
By applying this objective function, DisTop learns a consistent, not distorted representation that keeps close consecutive states while avoiding collapse (see Figure 2 for an example). In fact, by increasing and/or , one can select the level of distortion of the representation (cf. Appendix A for an analysis). One can notice that the function depends on the distribution of consecutive states in ; we experimentally found that using tuples from sufficiently stochastic policy is enough to keep the representation stable. We discuss the distribution of states that feed in §3.4. In the following, we will study how DisTop takes advantage of this specific representation to sample diverse or rewarding stategoals.
3.2 Mapping the topology to efficiently sample rare states
In this section, we define the skewed distribution of goalstates, so that DisTop increases the entropy of visited states by oversampling lowdensity goalstates. We propose to skew the sampling of visited states by partitioning the embedded space using a growing network; then we just need to weight the probability of sampling the partitions. It will promote sampling lowdensity goalstates (cf. §3.3) and will rule sampling for learning the goalconditioned and the representation (cf. §3.4). We first give the working principle before explaining how we define the skewed distribution and sample from it.
Discovering the topology.
Since our representation strives to avoid distortion by keeping close consecutive states, DisTop can match the topology of the embedded environment by clustering the embedded states. DisTop uses an adaptation of a GWQ (cf. §B) called Offpolicy Embedded Growing Network (OEGN). OEGN dynamically creates, moves and deletes clusters so that clusters generates a network that uniformly cover the whole set of embedded states, independently of their density; this is illustrated in the two last steps of Figure 1. Each node (or cluster)
has a reference vector
which represents its position in the input space. The update operators make sure that all embedded states are within a ball centered on the center of its cluster, i.e where is a threshold for creating clusters. is particularly important since it is responsible of the granularity of the clustering: if it is low, we will have a lot of small clusters, else we will obtain few large clusters that badly approximate the topology. The algorithm works as follows: assuming a new lowdensity embedded state is discovered, there are two possibilities: 1 the balls currently overlap: OEGN progressively moves the clusters closer to the new state and reduces overlapping; 2 the clusters almost do not overlap and OEGN creates a new cluster next to the embedded state. A learnt discrete topology can be visualized at the far right of Figure 2. Details about OEGN can be found in §B.Sampling from a skewed distribution
To sample more often lowdensity embedded states, we assume that the density of a visited state is reflected by the marginal probability of its cluster. So we approximate the density of a state with the density parameterized by , reference vector of : where is the number of states that belong to the cluster that contains . Using this approximation, we skew the distribution very efficiently by first sampling a cluster with the probability given by where is the skewed parameter (cf. §2). Then we randomly sample a state that belongs to the cluster.
While our approximation decreases the quality of our skewed objective, it makes our algorithm very efficient: we associate a buffer to each cluster and only weight the distribution of clusters; DisTop does not weight each visited state dblpPongDLNBL20, but only a limited set of clusters. In practice we can tradeoff the bias and the instability of OEGN by decreasing : the smaller the clusters are, the smaller the bias is, but the more unstable is OEGN and the sampling distribution. So, in contrast with SkewFit, we do not need to compute the approximated density of each state, we just need to keep states in the buffer of their closest node. In practice, we associate to each node a second buffer that takes in interactions if the corresponding node is the closest to the interaction’s goal. This results in two sets of buffers: and to sample stategoals and states, respectively, with the skewed distribution.
3.3 Selecting novel or rewarding skills
It is not easy to sample new reachable goals. For instance, using an embedding space , the topology of the environment may only exploit a small part of it, making most of the goals pointless. Similarly to previous works dblpWardeFarleyWKI19, DisTop generates goals by sampling previously visited states. To sample the states, DisTop first samples a cluster, and then samples a state that belongs to the cluster.
Sampling a cluster:
To increase the entropy of states, DisTop samples goalstates with the skewed distribution defined in §3.2 that can be reformulated as:
(4) 
It is equivalent to sampling with a simple Boltzmann policy , using a novelty bonus reward and a fixed temperature parameter . In practice, we can use a different than in §3.2 to tradeoff the importance of the novelty reward in comparison with an extrinsic reward (see below) or decrease the speed at which we gather lowdensity states dblpPongDLNBL20.
We can add a second reward to modify our skewed policy so as to take into consideration extrinsic rewards. We associate to a cluster a value that represents the mean average extrinsic rewards received by the skills associated to its goals :
(5) 
The extrinsic value of a cluster is updated with an exponential moving average where is the learning rate. To favours exploration, we can also update the extrinsic value of the cluster’s neighbors with a slower learning rate (cf. Appendix D). Our sampling rule can then be :
(6) 
Sampling a state:
Once the cluster is selected, we keep exploring independently of the presence of extrinsic rewards. To approximate a uniform distribution over visited states that belongs to the cluster, DisTop samples a vector in the ball that surrounds the center of the cluster. It rejects the vector and samples another one if it does not belong to the cluster. Finally, it selects the closest embedded state to the vector.
3.4 Learning goalconditioned policies
We now briefly introduce the few mechanisms used to efficiently learn the goalconditioned policies. Once the goalstate is selected, the agent embeds it as . It uses the target weights to make our goalconditioned policy independent from fluctuations due to the learning process. In consequence, we avoid using a handengineered environmentspecific scheduling dblpPongDLNBL20. Our implementation of the goalconditioned policy is trained with Soft ActorCritic (SAC)haarnoja2018soft and the reward is the L2 distance between the state and the goal in the learnt representation. In practice, our goalconditioned policy needs a uniform distribution of goals to avoid forgetting previously learnt skills. Our representation function requires a uniform distribution over visited states to quickly learn the representation of novel areas. In consequence, DisTop samples a cluster and randomly takes half of the interactions from and the rest from . We can also sample a ratio of clusters with if we do not care about forgetting skills (cf. Appendix A). , and are learnt over the same minibatch and we relabel the goals extracted from (cf. Appendix D).
4 Experiments
Our experiments aim at emphasizing the genericity of DisTop. Firstly, unlike generative models in the ground state space, DisTop learns the true topology of the environment even though the ground state representation lacks structure. Secondly, when it does not access rewards, the discovered skills are as diverse as a SOTA algorithm (SkewFit dblpPongDLNBL20) on robotic arm environments using images. Thirdly, even though acquiring diverse skills may be useless, DisTop is still able to perform close to a SOTA algorithm (SAC haarnoja2018soft) in MuJoCo environments todorov2012mujoco. Finally, we compare DisTop with a SOTA hierarchical algorithm (LESSON li2021learning) in a sparse reward version of an agent that navigates in a maze nachum2019data
. All curves are a smooth averaged over 5 seeds, completed with a shaded area that represent the mean +/ the standard deviation over seeds. Used hyperparameters are described in Appendix
E, environments and evaluation protocols details are given in Appendix F. Videos and images are available in supplementary materials.Is DisTop able to learn diverse skills without rewards ?
We assess the diversity of learnt skills on two robotic tasks Visual Pusher (a robotic arm moves in 2D and eventually pushes a puck) and Visual Door (a robotic arm moves in 3D and can open a door) dblpNairPDBLL18. We compare to SkewFit dblpPongDLNBL20. These environments are particularly challenging since the agent has to learn from 48x48 images using continuous actions. We use the same evaluation protocol than SkewFit: a set of images are sampled, given to the representation function and the goalconditioned policy executes the skill. In Figure 3, we observe that skills of DisTop are learnt quicker on Visual Pusher but are slightly worst on Visual Door. Since the reward function is similar, we hypothesize that this is due to the structure of the learnt goal space. In practice we observed that DisTop is more stochastic than SkewFit since the intrinsic reward function is required to be smooth over consecutive states. Despite the stochasticity, it is able to reach the goal as evidenced by the minimal distance reached through the episode. Stochasticity does not bother evaluation on Visual Pusher since the arm moves elsewhere after pushing the puck in the right position.
Does DisTop discover the environment topology even though the ground state space is unstructured ?
In this experiment, we analyze the representation learnt by a standard Variational Auto Encoder (VAE) and DisTop. To make sure that the best representation is visualizable, we adapted the gymminigrid environment gym_minigrid (cf. Figure 2) where a randomly initialized agent moves for fifty timesteps in the four cardinal directions. Each observation is either a 900dimensional onehot vector with 1 at the position of the agent (binary) or the (x,y) coordinates of the agent. Interactions are given to the representation learning algorithm. Except for OEGN, we display as a node the learnt representation of each possible state and connect states that are reachable in one step. In Figure 2, we clearly see that DisTop is able to discover the topology of its environment since each connected points are distinct but close with each other. In contrast, the VAE representation collapses since it cannot take advantage of the proximity of states in the ground state space; it only learns a correct representation when it gets (x,y) positions, which are wellstructured. This toy visualization highlights that, unlike VAEbased models, DisTop does not depend on wellstructured ground state spaces to learn a suitable environment representation for learning skills.
Can DisTop solve nonhierarchical dense rewards tasks?
We test DisTop on MuJoCo environments Halfcheetahv2 and Antv2 todorov2012mujoco, the agent gets rewards to move forward as fast as possible. We fix the maximal number of timesteps to 300. We compare DisTop to our implementation of SAChaarnoja2018soft and to ELSIM dblpAubretMH20, a method that follows the same paradigm than DisTop (see §5). We obtain better results than in the original ELSIM paper using similar hyperparameters to DisTop. In Figure 3 (Right), we observe that DisTop obtains high extrinsic rewards and clearly outperforms ELSIM. It also outperforms SAC on Halfcheetahv2, but SAC still learns faster on Antv2. We highlight that SAC is specific to dense rewards settings and cannot learn in sparse settings (see below).
Does combining representation learning, entropy of states maximization and tasklearning improve exploration on highdimensional hierarchical tasks ?
We evaluate DisTop ability to explore and optimize rewards on imagebased versions of Ant Maze and Point Maze environments nachum2019data. The state is composed of a proprioceptive state of the agent and a top view of the maze ("Image"). In contrast with previous methods nachum2019data; li2021learning, we remove the implicit curriculum that changes the extrinsic goal across episodes; here we train only on the farthest goal and use a sparse reward function. Thus, the agent gets a nonnegative reward only when it gets close to the goal. We compare our method with a SOTA HRL method, LESSON li2021learning (see §5), ELSIM and our implementation of SAC haarnoja2018soft. For ELSIM, LESSON and DisTop, we only pass the "image" to the representation learning part of the algorithms, assuming that an agent can separate its proprioceptive state from the sensorial state. In Figure 4, we can see that DisTop is the only method that manages to return to the goal; in fact, looking at a learnt 3D OEGN network, we can see that it successfully represent the Ushape form of the maze and set goals close to the extrinsic goal. LESSON discovers the goal but does not learn to return to it^{1}^{1}1In Point Maze, the best seed of LESSON returns to the goal after 5 millions timesteps; we hypothesize that its highlevel policy takes too long to gather enough interactions to learn on and that it still requires too many random highlevel actions to find the goal. Neither SAC, nor ELSIM find the reward. We suppose that the undirected width expansion of the tree of ELSIM does not maximize the stateentropy, making it spend too much time in useless areas and thus inefficient for exploration. We admit that DisTop has high variance, but we argue that it is more a credit assignment issue sutton1999between due to the longterm reward than an exploration problem.
5 Related works
Intrinsic skills in hierarchical RL
Some works propose to learn hierarchical skills, but do not introduce a default behavior that maximizes the visited state entropy hazan2019provably, limiting the ability of an agent to explore. For example, it is possible to learn skills that target a ground state or a change in the ground state space nachum2019data; levy2018hierarchical. These approaches do not generalize well with highdimensional states. To address this, one may want to generate rewards with a learnt representation of goals.NOR nachum2018near bounds the suboptimality of such representation to solve a task and LESSON li2021learning
uses a slow dynamic heuristic to learn the representation. In fact, it uses an InfoNCElike objective function; this is similar to
dblpjournals/corr/abs191213414 which learns the representation during pretraining with random walks. DCO jinnai2019exploration generates optionsby approximating the second eigenfunction of the combinatorial graph Laplacian of the MDP. It extends previous works
dblpMachadoBB17; bar2020option to continuous state spaces. Abovementioned methods uses a hierarchical random walk to explore the environment, we have shown in §4 that DisTop explores quicker by maximizing the entropy of states in its topological representation.Intrinsic motivation to learn diverse skills.
DisTop simultaneously learns skills, their goal representation, and which skill to train on. It contrasts with several methods that exclusively focus on selecting which skill to train on assuming a good goal representation is available dblpFlorensaHGA18; dblpcurious; fournier2019clic; zhang2020automatic; dblpColasKLDMDO20. They either select goals according to a curriculum defined with intermediate difficulty and the learning progress oudeyer2009intrinsic or by imagining new languagebased goals dblpColasKLDMDO20. In addition, DisTop strives to learn either skills that are diverse or extrinsically rewarding. It differs from a set of prior methods that learn only diverse skills during a pretraining phase, preventing exploration for endtoend learning. Some of them maximize the mutual information (MI) between a set of states and skills. Typically, DIAYN dblpEysenbachGIL19, VALOR dblpjournals/corr/abs180710299 and SNN dblpFlorensaDA17 learn a discrete set of skills, but hardly generalize over skills. It has been further extended to continuous set of skills, using a generative model dblpSharmaGLKH20 or successor features dblpHansenDBWWM20; dblpBorsaBQMHMSS19. In both case, directly maximizing this MI may incite the agent to focus only on simple skills campos2020explore. DISCERN dblpWardeFarleyWKI19 maximizes the MI between a skill and the last state of an episode using a contrastive loss. Unlike us, they use the true goal to generate positive pairs and use a L2 distance over pixels to define a strategy that improves the diversity of skills. In addition, unlike VAEbased models, our method better scales to any ground state space (see §4). Typically, RIG dblpNairPDBLL18 uses a VAE dblpKingmaW13 to compute a goal representation before training the goalconditioned policies. Using a VAE, it is possible to define a frontier with a reachability network, from which the agent should start stochastic exploration bharadhwaj2020leaf; but the gradual extension of the frontier is not automatically discovered, unlike approaches that maximize the entropy of states (including DisTop). SkewFit dblpPongDLNBL20 further extended RIG to improve the diversity of learnt skills by making the VAE overweight lowdensity states. Unlike DisTop, it is unclear how SkewFit could target another distribution over states than a uniform one. Approaches based on learning progress (LP) have already been built over VAEs kovavc2020grimgep; laversanne2018curiosity; we believe that DisTop could make use of LP to avoid distractors or further improve skill selection.
Skill discovery for endtoend exploration.
Like DisTop, ELSIM dblpAubretMH20 can discover diverse and rewarding skills in an endtoend way. It builds a tree of skills and selects the branch to improve with extrinsic rewards. DisTop outperforms ELSIM for both dense and sparserewards settings (cf. §4). This endtoend setting has also been experimented through multigoal distribution matching pitis2020maximum; dblpleelisa where the agent tries to reduce the difference between the density of visited states and a given distribution (with highdensity in rewarding areas). Yet, either they approximate a distribution over the ground state space dblpleelisa or assume a wellstructured state representation pitis2020maximum. Similar wellstructured goal space is assumed when an agent maximizes the rewardweighted entropy of goals zhao2019maximum.
Dynamicaware representations.
A set of RL methods try to learn a topological map without addressing the problem of discovering new and rewarding skills. Some methods dblpSavinovDK18; dblpSavinovRVMPLG19; DblpEysenbachSL19 consider a topological map over direct observations, but to give flat intrinsic rewards or make planning possible. We emphasize that SFAGWRHRL zhou2019vision hierarchically takes advantage of a topological map built with two GWQ placed over two Slow Feature Analysis algorithms wiskott2002slow; it is unclear whether it can be applied to other environments than their robotic setting. Functional dynamicaware representations can be discovered by making the distance between two states match the expected difference of trajectories to go to the two states dblphoshGL19; interestingly, they exhibit the interest of topological representations for HRL and propose to use a fix number of clusters to create goals. Previous work also showed that an active dynamicaware search of independent factors can disentangle the controllable aspects of an environment dblpBengioTPPB17.
6 Conclusion
We introduced a new model, DisTop, that simultaneously learns a topology of its environment and the skills that navigate into it. In particular, it concentrates the skill discovery on parts of the topology with lowdensity embedded states until it discovers an extrinsic reward. In contrast with previous approaches, there is no pretraining pitis2020maximum; dblpPongDLNBL20, particular scheduling dblpPongDLNBL20 or random walks dblpjournals/corr/abs191213414; bharadhwaj2020leaf. It avoids catastrophic forgetting, does not need a wellstructured goal space pitis2020maximum or dense rewards li2021learning. Throughout all the experiments, we have shown that DisTop is generic : it keeps working and learn a dynamicaware representation in dense rewards setting, pixelbased skill discovery environments and hard sparse rewards settings. Yet, there are limitations and exciting perspectives: HRL dblphoshGL19 and planning based approaches dblpNasirianyPLL19 could both take advantage of the topology and make easier states discovery; Frontierbased exploration bharadhwaj2020leaf could also be explored to reduce skill stochasticity. Disentangling the topology bengio2013representation and learning progress measures kovavc2020grimgep
could further improve the scalability of the approach. Finally, we argue that, by tackling the problem of unsupervised learning of both dynamicaware representations and skills without adding constraints in comparison to standard RL, DisTop opens the path towards more humanlike openended DRL agents.
References
Appendix A Ablation study
In this section, we study the impact of the different key hyperparameters of DisTop. Except for paper results, we average the results over 3 random seeds. For visualizations based on the topology, the protocol is the same as the one described §4. In addition, we select the most viewable one among 3 learnt topologies; while we did not observe variance in our analysis through the seeds, the 3D angle of rotation can bother our perception of the topology.
Details about maze experiments.
Controlling the distortion of the representation.
The analysis of our objective function shares some similarities with standard works on contrastive learning chopra2005learning. However, we review it to clarify the role of the representation with respect to the interactions, the reward function and OEGN.
In Figure 6, we can see that the distortion parameter rules the global dilatation of the learnt representation. A low also increases the distortion of the representation, which may hurt the quality of the clustering algorithm. is competing with , the temperature hyperparameter. As we can see in Figure 8, rules the minimal allowed distance between very different states. So that there is a tradeoff between a low that distorts and makes the representation unstable, and a high that allows different states to be close with each other. With a high , the L2 rewards may admit several local optimas.
In Figure 7, we see that also impacts the distortion of the representation, however it mostly limits the compression of the representation in the areas the agent often interacts in. In the borders, the agent often hurts the wall and stays in the same position. So in comparison with large rooms, the bring together is less important than the move away part. limits such asymmetries and keeps large rooms dilated. Overall, this asymmetry also occurs when the number of states increases due to the exploration of the agent: the agent progressively compresses its representation since interesting negative samples are less frequent.
The size of the clusters of OEGN has to match the distortion of the representation. Figure 9 emphasizes the importance of the creation threshold parameter . If this is too low (
), clusters do not move and a lot of clusters are created. OEGN waits a long time to delete them. If it is too high, the approximation of the topology becomes very rough, and states that belong to very different parts of the topology are classified in the same cluster; this may hurt our density approximation.
Selection of interactions:
Figure 10 shows the importance of the different hyperparameters that rule the sampling of goals and states. In Ant environment, we see that the agent has to sample a small ratio of learning interactions from rather than ; it speeds up its learning process by making it focus on important interactions. Otherwise, it is hard to learn all skills at the same time. However, the learning process becomes unstable if it deterministically samples from a very small set of clusters (ratios and ).
Then we evaluate the importance of on Visual Pusher. We see that the agent learns quicker with a low . It hardly learns when the highlevel policy almost does not oversample lowdensity clusters (). This makes our results consistent with the analysis provided in the paper of SkewFit dblpPongDLNBL20.
In the last two graphics of Figure 10, we show the impact of ; we observe that the agent learns quicker when is close to . It highlights that the agent quicker learns both a good representation and novel skills by sampling uniformly over the clusters.
Overall, we also observe that some seeds become unstable. Since they coincide with large deletions of clusters. We expect that an hyperparameter search on the delete operators of OEGN may solve this issue in these specific cases.
Appendix B OEGN and GWQ
Several methods introduced unsupervised neural networks to learn a discrete representation of a fixed input distribution kohonen1990self; fritzke1995growing; MarslandSN02. The objective is to have clusters (or nodes) that cover the space defined by the input points. Each node has a reference vector which represents its position in the input space. We are interested in the Growing when required algorithm (GWQ) MarslandSN02 since it updates the structure of the network (creation and deletion of edges and nodes) and does not impose a frequency of node addition. OEGN is an adaptation of GWQ; algorithm 1 describes the core of these algorithms and their specific operators are described below. Specific operations of OEGN are colored in red, the ones of GWQ in blue and the common parts are black. Our modifications are made to 1 make the network aware of the dynamics; 2 adapt the network to the dynamically changing true distribution.
Delete operator:
Delete if is above a threshold to verify that the node is still active; delete the less filled neighbors of two neighbors if their distance is below to avoid too much overlapping; check it has been selected times before deleting it. Delete if a node does not have edges anymore.
Both creation and moving operators:
Check that the distance between the original goal and the resulting embedding is below a threshold .
Creation operator:
Check if . Verify has already been selected times by . If the conditions are filled, a new node is created at the position of the incoming input and it is connected to . Update a firing count of the winning node and its neighbors and check it is below a threshold. A node is created halfway between the winning node and the second closest node. The two winning nodes are connected to the new node.
Moving operator:
If no node is created or deleted, which happens most of the time, we apply the moving operator described by eq. 7 assuming . In our case, we use a very low to avoid the nodes to concentrate close to highdensity areas.
(7) 
Edge operator:
edges are added or updated with attribute between the winning node of and the one of . Edges with are added or updated between the two closest nodes of . When an edge is added or updated, increment the age of the other neighbors of and delete edges that are above a threshold .
Except for and , we emphasize that thresholds parameters have not been finetuned and are common to all experiments; we give them in Table 1.
Parameters  Value 

Age deletion threshold  
Error deletion count  
Proximity deletion threshold  
Count creation threshold  
Minimal number of selection  
Learning rate  
Neighbors learning rate  
Number of updates per batch 
Appendix C Derivation of the contrastive loss
(8)  
(9)  
(10) 
In the last line of eq. 10, we upperbound with 1 since when is positive. The logarithmic function is monotonic, so the negative logarithm inverses the bound.
Appendix D Implementation details
In this section, we provide some details about DisTop.
Number of negative samples.
In practice, to learn we do not consider the whole batch of states as negative samples. For each positive pair, we randomly sample only 10 states within the batch of states. This number has not been finetuned.
Relabeling strategies.
We propose three relabeling strategies;

relabelling: we take samples from and relabel the goal using clusters sampled with and randomly sampled states. This is interesting when the agent focuses on a task; this gives more importance to rewarding clusters and allows to forget uninteresting skills. We use it in Ant and HalfCheetah environments.

Uniform relabelling: we take samples from and relabel the goal using states sampled from from . When , this is equivalent to relabeling uniformly over the embedded state space. This is used for maze environments and Visual Door.

Topological relabelling: we take samples from both and and relabel each goal with a state that belongs to a neighboring cluster. This is interesting when the topology is very large, making an uniform relabelling inefficient. This is applied in Visual Pusher experiments, but we found it to work well in mazes and Visual Door environments.
d.1 Comparison methods
Lesson:
we used the code provided by the authors and reproduced some of their results in dense rewards settings. Since the environments are similar, we used the same hyperparameter as in the original paper li2021learning.
SkewFit:
since we use the same evaluation protocol, we directly used the results of the paper. In order to fairly compare DisTop to SkewFit, the state given to the DRL policy of DisTop is also the embedding of the true image. We do not do this in other environments. We also use the exact same convolutional neural network (CNN) architecture for weights
as in the original paper of SkewFit. It results that our CNN is composed of three convolutional layers with kernel sizes: 5x5, 3x3, and 3x3; number of channels: 16, 32, and 64; strides: 3, 2 and 2. Finally, there is a last linear layer with neurons that corresponds to the topology dimensions
. This latent dimension is different from the ones of SkewFit, but this is not tuned and set to 10.Elsim:
we use the code provided by the authors. We noticed they used very small batch sizes and few updates, so we changed the hyperparameters and get better results than in the paper on HalfCheetah. We set the batch size to and use neural networks with hidden layers. The weight decay of the discriminator is set to in the maze environment and in Ant and HalfCheetah. In Ant and HalfCheetah, the learning process was too slow since the agent sequentially runs up to 15 neural networks to compute the intrinsic reward; so we divided the number of updates by two. In our results, it did not bring significant changes.
Sac:
we made our own implementation of SAC. We made a hyperparameter search on entropy scale, batch size and neural networks structure. Our results are consistent with the results from the original paper haarnoja2018soft.
Appendix E Hyperparameters
Table 2 shows the hyperparameters used in our main experiments. We emphasize that tasks are very heterogeneous and we did not try to homogenize hyperparameters across environments.
Parameters  Values  Comments 
DRL algorithm SAC  
Entropy scale  
Hidden layers  
Number of neurons  Smaller may work  
Learning rate  RP: else  Works with both 
Batch size  RP: else  Works with both 
Smooth update  RP: else  Works with both. 
Discount factor  Tuned for mazes  
Representation  
Learns on  No/No/No/No/Yes/Yes  Works with both 
Learning rate  , MA: , MC:  Not tuned on MA, MC 
Number of neurons  256 except robotic images  Not tuned 
Hidden layers  2 except robotic images  Not tuned 
Distortion threshold  SPM: else  Tuned on SPM 
Distortion coefficient  20  See Appendix A 
Consistency coefficient  RD: else  Not tuned 
Smooth update  0.001  Not tuned 
Temperature  See Appendix A  
Topology dimensions  Not tuned  
OEGN and sampling  
Creation threshold  RP: else 0.6  See Appendix A 
Success threshold  
Buffers size  
Skew sampling  RD: else  See Appendix A 
updates per steps  
Highlevel policy  
Learning rate  0.05  Tuned 
Neighbors learning rate  Not finetuned  
Skew selection  See Appendix A  
Reward temperature 
Appendix F Environment details
f.1 Robotic environments
Environments and protocols are as described in dblpPongDLNBL20. For convenience, we sum up again some details here.
Visual Door:
a MuJoCo environment where a robotic arm must open a door placed on a table to a target angle. The state space is composed of 48x48 images and the action space is a move of the end effector (at the end of the arm) into (x,y,z) directions. Each direction ranges in the interval [1,1]. The agent only resets during evaluation in a random state. During evaluation, goalstates are sampled from a set of images and given to the goalconditioned policy. At the end of the 100steps episode, we measure the distance between the final angle of the door and the angle of the door in the goal image.
Visual Pusher:
a MuJoCo environment where a robotic arm has to push a puck on a table. The state space is composed of 48x48 images and the action space is a move of the end effector (at the end of the arm) in (x,y) direction. Each direction ranges in the interval [1,1]. The agent resets in a fixed state every 50 steps. During evaluation, goalstates are sampled randomly in the set of possible goals. At the end of the episode, we measure the distance between the final puck position and the puck position in the goal image.
f.2 Maze environments
These environments are described in nachum2018near and we used the code modified by li2021learning. For convenience, we provide again some details and explain our sparse version. The environment is composed of 8x8x8 fixed blocks that confine the agent in a Ushaped corridor displayed in Figure 4.
Similarly to li2021learning, we zeroout the (x,y) coordinates and append a lowresolution top view of the maze to the proprioceptive state. This "image" is a 75dimensional vector. In our sparse version, the agent gets 0 reward when its distance to the target position is below 1.5 and gets 1 reward otherwise. The fixed goal is set at the topleft part of the maze.
Sparse Point Maze:
the proprioceptive state is composed of 4 dimensions and its 2dimensional action space ranges in the intervals [1,1] for forward/backward movements and [0.25,0.25] for rotation movements.
Sparse Ant Maze:
the proprioceptive state is composed of 27 dimensions and its 8dimension action space ranges in the intervals [16,16].
Appendix G Computational resources
Each simulation runs on one GPU during 20 to 40 hours according to the environment. Here are the settings we used:

Nvidia Tesla K80, 4 CPU cores from of a Xeon E52640v3, 32G of RAM.

Nvidia Tesla V100, 4 CPU cores from a Xeon Silver 4114, 32G of RAM.

Nvidia Tesla V100 SXM2, 4 CPU cores from a Intel Cascade Lake 6226 processors, 48G of RAM. (Least used).
Appendix H Example of skills
Figures 11, 12, 13 show examples skills learnt in respectively Visual Door, Visual Pusher and Ant Maze. Additional videos of skills are available in supplementary materials. We also provide videos of the topology building process in maze environments. We only display it in maze environments since the 3Dtopology is suitable.
Comments
There are no comments yet.