Large Scale Environment Partitioning in Mobile Robotics Recognition Tasks

—In this paper we present a scalable machine learning approach to mobile robots visual localization. The applicability of machine learning approaches is constrained by the complexity and size of the problem’s domain. Thus, dividing the problem becomes necessary and two essential questions arise: which partition set is optimal for the problem and how to integrate the separate results into a single solution. The novelty of this work is the use of Information Theory for partitioning high-dimensional data. In the presented experiments the domain of the problem is a large sequence of omnidirectional images, each one of them providing a high number of features. A robot which follows the same trajectory has to answer which is the most similar image from the sequence. The sequence is divided so that each partition is suitable for building a simple classiﬁer. The partitions are established on the basis of the information divergence peaks among the images. Measuring the divergence has usually been considered unfeasible in high-dimensional data spaces. We overcome this problem by estimating the Jensen-R´enyi divergence with an entropy approximation based on entropic spanning graphs. Finally, the responses of the different classiﬁers provide a multimodal hypothesis for each incoming image. As the robot is moving, a particle ﬁlter is used for attaining the convergence to a unimodal hypothesis.


I. INTRODUCTION
M OBILE robotics is a field with increasing number of applications and the range of possible environments for a robot is becoming wider.The capability of learning without the need of human aid is becoming a basic aspect in mobile robotics.Among the variety of machine learning applications to mobile robotics, an essential one is localization.The localization problem consists of estimating the position of the robot in a given map, based on the information provided by the sensors of the robot, such as cameras and range sensors, as well as the odometry of the robot, if available.This work presents a novel approach to image-similarity-based localization.The main purpose of the method is to scalability to large and complex environments, and this aspect is dealt with by dividing the domain in as many partitions as needed.
Cameras provide rich information and its correct interpretation is a complex and open problem.There are two well differentiated approaches to visual recognition: the structuraldescription models and the image-based models.The first one is usually more complex and task-specific.The image-based recognition, also known as appearance-based recognition, relies on general features extracted from the image, ignoring the structures.In robotics this approach has already been used for localization.For instance, in [1] Menegatti et al. weight omnidirectional samples according to image similarity, to implement a Monte-Carlo localization for constrained indoor environments.This technique is used for managing multimodal probabilty density, for example when the current image matches more than one reference image.Still the complexity and size o the environment restrict the applicability of this method, because the image matching relies on a single similarity function of the Fourier coefficients of the reference images.Any classifier has a limited capacity which limits its scalability to deal with more complex patterns [2].
In this work we propose an unsupervised division of the sequence of reference images in several subsequences of images.For each one of the partitions we associate a similarity measure which yields an appropriate image retrieval result.We take as a similarity measure the euclidean distance in a suitable feature space which is automatically selected from a general set of low-level features.Two major questions arise from this approach: which division to perform on the sequence and, given a test image, which one of the several similarity measures to consider.For dealing with the first problem we make use of Information Theory.We estimate the local Jensen-Rényi divergence [3], [4] among the previous and the next images of the sequence and the divergence peaks determine the limits between two consecutive partitions.The Jensen-Rényi divergence is estimated in the feature space of the images, where the features are a set of low-level filters, similarly to a previous work [5].The second question we have to deal with, is how to put together the results generated by the different similarity measures.This problem is tackled with a particle filter [6], provided that the robot moves in some direction of the trajectory.In this experimental setup we are assuming that the robot will follow the same trajectory in which the reference images are taken.This assumption is inspired by a previous work [7] in which a robot performs vision-based navigation along corridor-like environments.
The rest of the paper is organized as follows.In Section II we explain the setup of the experiments.In Section III we present our information theoretic approach to data partitioning.Next, in Section IV we explain how we obtain a similarity measure for each different partition.Then in Section V we present a way put the results together and we show some results.In Section VI we conclude presenting our conclusions and future work.

II. EXPERIMENTAL SETUP
The experimental procedure for the presented system consists of making a camera-equipped mobile robot or a person follow some definite path or trajectory, in order to save video or a dense sequence of images.We call this sequence reference images because localization is with respecto to them.In the presented experiments the size of sequence is 440 images taken along a 180m-long indoor/outdoor route which is shown in Fig. 1.Once the images are collected, the system unsupervisedly performs a partition of the sequence.Then, a feature selection process is performed for each partition, in order to optimize each similarity function for its associated interval of images.The bank of filters used for selection consists of rotation invariant low-level filters such as edge detectors and color filters.Rotation invariance is possible because the camera we use is omnidirectional and is vertically oriented, as shown in Fig. 2.
The initial set of filters F consists of the responses of each image to the following filters: The filters are applied to 6 different scales of the image.Moreover, the omnidirectional image is divided en 4 concentric rings and filters are also separately applied to each individual ring.This ring division and the whole feature extraction process are explained in more detail in [5].Therefore the filters bank F consists of |F| = 17 × 6 × 5 = 510 features.Each image represents a point in this features space which is invariant to rotation and is much less sensible to small image variances than the raw pixels space.The comparisons between images are performed in this feature space or a subset of it, as detailed in Section IV.After the learning phase, we perform tests which consist of starting from some random position in the reference trajectory and moving in some direction, which can be forwards or backwards with respect to the direction of the reference trajectory.Localization along a trajectory makes sense in environments where the robot follows only forward/backward paths, like corridors, streets or avenues.Some navigation methods have already been developed for such situations.Concretely in [7] is described a previous work in which corridor-following navigation is performed using the same vision sensor that we use in this work.As the robot is navigating, it takes test images at a fixed time interval.The first test image taken causes a multimodal response.The next images will make the system converge to a unique hypothesis for the position of the robot, with an error of ±2 reference images.In the presented experiments about 10-15 images are necessary for achieving convergence.For larger sequences convergence to a unimodal hypothesis would be slower, keeping a low error.

III. DATA PARTITIONING A. Motivation
Dividing the data in several partitions is a key to scalability.In this work we have to establish a similarity measure capable of indicating the most similar reference image to a new input image.This measure is formulated in terms of a distance between images in an appropriate feature space, further explained in Section IV.This formulation is equivalent to a K-nearest neighbour (K-NN) classifier where each reference pattern is a separate class.In the Machine Learning field it is well known that the capacity of a classifier is in a tradeoff with its generalization properties [8].In other words, a large amount of data requires a high capacity, which incurs on an inaccurate classification of test patterns (those which do not belong to the training set or reference patterns).To avoid a classification accuracy decrease with the increase of the amount of training data we divide the data in as many partitions as needed.On the other hand a formulation with a single classifier is unable to handle the perceptual aliasing problem, which refers to different states producing similar sensor response.For example, a long trajectory could include corridors with identical appearance.In these circumstances the only way to know the correct location is to have into account the previous images, which is explined in detail in Section V.In the following Subsection we explain the criterion we propose for data partitioning.

B. Partitioning
Finding the optimal partitioning of the sequence of images would involve evaluating each possible subsequences configuration.As the optimal number of subsequences is also unknown, there are N k=0 N k possible partition sets configurations for an amount of N reference images.In the presented experiments, N = 440 which means that there are 2, 8392 • 10 132 combinations.The complexity of the problem makes it necessary to use some heuristics for finding a good partition set instead of the optimal one.
The heuristic we propose to use for this problem is based on Information Theory.The idea is to find discontinuities in the sequence of reference images in the feature space F. The reason for this is that each partition has associated its own similarity measure, based on a subset of those features.For example, when the trajectory leaves the laboratory and the corridor begins, there is a significant discontinuity which we are interested in.Information Theory offers tools for finding such interesting places in terms of entropy analysis.Entropy analysis allows us to measure changes in the amount of information in the data, regardless of data's nature.An example can be seen in Fig. 3.In this example a single feature is analyzed in order to show the idea of entropy divergence.However, we do not analyze single features but we estimate divergence in the whole feature space, as explained in the following subsection.

C. Jensen-Rényi divergence
In order to calculate the entropy divergence over the whole feature space of the images we need a way of estimating entropy in high-dimensional spaces.Traditionally this has been a drawback and entropy has usually been estimated in one or two dimensions because of the high computational complexity and estimation errors in spaces with more dimensions.However there are methods for entropy estimation which do not depend on the number of dimensions of the data.A widely used one is the estimation of Rényi entropy with entropic spanning graphs, described by Hero and Michel in [9].In [10] we explain the estimation of Rényi entropy and further approximation to the Shannon entropy for calculating the Mutual Information with a feature selection purpose.Also, in [11] entropic spanning graphs are used for Mutual Information estimation for image registration.In this work we use the Jensen-Rényi divergence which is a more general criterion than Mutual Information, as shown in [3].
In [4] Hamza and Krim explain the properties of the divergence and show an example of edge detection for image segmentation.Although Jensen-Rényi divergence can measure the divergence among several sets of data, a simpler way to detect edges is to use a sliding window.This window is split in two subwindows W 1 and W 2 and their divergence is evaluated, as represented in Fig. 4. The general formula of the divergence for n probability distributions p 1 , p 2 , • • • , p n is defined as: where H α (p) is the Rényi entropy, α is the Rényi entropy's parameter, and ω = (ω 1 , ω 2 , • • • , ω 3 ) is a weight vector which satisfies n i=1 ω i = 1 and ω i ≥ 0 and α.In the case of the sliding window the probability distributions are p 1 corresponding to the subwindow W 1 and p 2 corresponding to W 2 .Setting equal weights ω i reduces the formula of the window's divergence to: The Rényi entropy (also known as α-entropy) of a probability density function p( x) is defined as: and it can be estimated by building the minimal spanning tree MST({ x i }) of the data x and computing the weighted length of its edges e: where d is the number of dimensions of the data.The following Rényi entropy estimator is asymptotically stable and consistent for D ≥ 2, as showed in [12]: where γ = D(1 − α), N and D are the number of samples and the number of dimensions of the data x, and β Lγ ,d is a constant not depending on the probability function but on the graph minimization criterion.An approximation [13] that can be used for large d is β Lγ ,d ≈ γ 2 ln d 2πe .Finally even though the α parameter is fundamental in α-entropy, it does not have a significant effect on the entropy divergence (Eq.2).In our experiments α = 0.8.
To sum up, we can perform entropy divergence analysis like in Fig. 3 using a sliding window and calculating the divergence with Eq. 2. The approximation in Eq. 5 makes it possible to work with a large numbre of dimensions, which in our case is d = |F| = 510 features.A question that arises now is: which size of the window is the most appropriate?It depends on the environment and on the distance between the images.However a multiscale analysis shows up that the discontinuities of interest remain with several windows sizes, while those which do not interest us get displaced or disappear.See Fig. 5 where the divergence at 30 window sizes is represented.It is easily observed that at the images 241, 254 and 273, there are gradients in the divergence at all the scales while the other gradients get diagonally displaced.Therefore the criterion that we establish for dividing the sequence in partitions is the presence of strong peaks in the gradient function, for those gradients which are present at different scales.The result of partitioning the whole sequence or 440 images is shown in Fig. 6 where the multiscale divergence gradient is represented and the most peaked points are selected.There are 19 important discontinuities so 20 partitions are established.In the next Section we explain how we perform localization for each single partition.

IV. LOCALIZATION IN EACH PARTITION
This Section explains how we obtain a similarity function which is adequate for a definite subsequence of reference images and how we use it for new images.Given a sequence of N images I = (I 1 , • • • , I N ) which are associated to the linear positions along the trajectory S = (s 1 , • • • , s N ), and given a test image I T associated to a s T position, the objective is to find an image similarity measure M (I i , I T ) which minimizes the error between the real position of the test image and the estimated position: where î is the index of the reference image I î which is the most similar (has the shortest distance) to I T : Provided that the images are in a D-dimensional feature space F = {F 1 , • • • , F D } (already explained in Section II) we define the dissimilarity measure as the weighted euclidean distance in F: where the weights ω = (ω 1 , • • • , ω D ), ω i ∈ {0, 1} determine which features are considered and which are not.These weights have to be set for minimizing the objective defined in Eq. 6.
In order to achieve a good generalization for new images, the maximum number of possible test images have to be considered.According to the definition of the problem (Section II) we have at our disposal only one sequence of N images for the training process.Therefore we have to separate it in train set and test set.The Leave One Out Cross Validation (LOOCV) evaluation method maximizes the number of tests.With this procedure the sequence is divided in a train set of N −1 images and a test of 1 image; the evaluation is repeated N times until every image in the sequence has played the test role.The following algorithm iteratively selects important features in a greedy order: Fig. 8. Responses of 20 different similarity measures, each one trained for a particular subsequence of images.Note the coherence between test and reference images in the diagonal of the plot.Each test image produces 20 different hypotheses, however only one of them is coherent with the previous test images, if they are taken in a sequential order.
The algorithm does not have a stopping criterion.Instead, it keeps on selecting features until all of them (D) are selected, and stores them in order, together with their associated classification errors.Finally, it returns the weights vector which has the lowest associated error.
Once performed the feature selection process, the dissimilarity measure for the set of images I is applied accordingly to Eq. 8 and the weights ω.When a new image arrives, it is said to be closest to some reference image T î, where î is the number of reference image which has the closest distance M ω to the new image, as expressed in Eq. 7. In Fig. 8 are represented the estimations of the 20 different similarity measures, for 220 test images which have not been used for the training proces.It can be observed that for a single test image, each similarity function has a different response because each one is trained for a different subsequence of images.In the following Subsection we explain how we put these results together to obtain a single estimation.

V. LOCALIZATION IN THE WHOLE DOMAIN
In mobile robotics localization a very important source of information is the history of previous perceptions of the robot, as well as odometry, if available.When a new perception produces multiple hypothesis, as shown in Fig. 8, history can help to disambiguate and converge to a unimodal hypothesis.We tackle the problem with the classical Monte Carlo Localization (MCL), also known as particle filter.This is a Bayesian approach which aims to estimate recursively the posterior distribution p(s T 0 , s we can obtain the posterior mode of the state, which in our case is the estimated position with respect to the reference images.This representation is approximate but it is nonparametric, which makes it possible to represent a wide range of distributions.
The MCL algorithm is described and analyzed in [6].It samples the posterior distribution with a set of particles which correspond to positions in the trajectory.We take as many particles as similarity functions we have, however this number could be dynamically changed, according to the complexity of the distribution.For each newly obtained test image, the algorithm first resamples each particle according to a motion model.In our case the motion model is a bimodal distribution with mean ±1 image and variance 2 images because we assume that the robot moves and takes test images at the same speed as in the training run, and we assume that it can move both forward and backward in the defined trajectory.We do not use odometry in the experiments we present.
Once changed the values of the particle positions, their weights have to be calculated, according to their likelihood.The likelihood is the probability of the perception, given the particle position.For example see Fig. 8 and lets say that a particle was sampled to the position 400 which is the last interval we have in the sequence, then the likelihood will be determined by the response of the similarity function denoted in the plot as H1., which is arg min j M ω (I T k , I j ).It can be seen in the plot that if the perception I T k was actually coming from the image #200 of the test set, then M ω (I 200 , I 400 ) = 0, so j = 400, which is the same value as the particle position and this would be the maximum likelihood possible.Otherwise, let us say that the perception came from the image #190, then according to the same similarity function, j = 380 and the likelihood is lower.Finally, if the perception came from some test image number lower than #170, the response would be rather random, as shown in the plot.In this case, even if we obtain a high likelihood for the particle in this iteration, in the following iterations it would get low because the new perceptions would be inchoerent with the motion model.
Once calculated the weights of the particle, the algorithm performs an importance sampling of the particles.As many particles as needed (a constant number in our implementation) are sampled according to the likelihood of each one of the previously existing particles.This means that particles with a low likelihood will probably disappear, while locations with a high density of particles with a good likelihood will cause a higher concentration of particles.This can be observed in Fig. 9.
The fact that zones with a high likelihood attract the rest of the particles can mislead the algorithm to converge to some incorrect location.When this happens, after a few observations the likelihood of the particles gets very low.To get over such situation a simple mechanism is introduced, which takes some particles with a low likelihood and places them randomly along the whole trajectory space.The number of randomized particles depends on the mean likelihood of all the particles.This way if all the particles have a high likelihood, little particles are randomized.This mechanism is also useful for the kidnapped robot problem, which consists of taking the robot from its location and make it jump to a different one.In Fig. 10 is shown the performance of a single similarity function over the whole space of images, in contrast to the use of several functions, Fig. 8.It can be seen that the discontinuities present in a single similarity function are overcome with the partitioning approach.

VI. CONCLUSION AND FUTURE WORK
In this work we present a scalable visual localization approach which allows an autonomous robot to get localized over large trajectories.It is an approach based on machine learning and no assumptions are made about the training images.Only one training run over the trajectory is needed for the camera to collect all the reference images.We propose to use an information divergence measure for finding interesting places and partitioning the data.Then we propose to use simple specialized classifiers for each different partition.The off-line learning process for the presented experiment took about 10 minutes both for partitioning and for selecting the features for the different classifiers.Finally a unique localization estimation is obtained with a Monte Carlo sampling method.Classification of each new image can be performed online as its feature extraction process takes several cents of a second and the similarity measures evaluation is very fast.The results are promising, as shows the experiment with an indoor/outdoor trajectory of a 180m length, in which a unimodal hypothesis is usually achieved before the first 12 observations.A future work is to extend this approach to 2D localization and to topological maps construction and localization.One of the contributions useful for topological navigation is finding interesting places in a quite data-independent way.These interesting places could be the nodes of a topological map, where a decision about which way to follow could be taken., given by: a) Single similarity measure for the whole set of images; the response is marked with a red cross.b) Monte Carlo Localization response, the particles are represented with vertical lines.c) Groundtruth, represented with a red circle.It can be observed that for the single similarity measure, the localization result is far from the groundtruth, due to the discontinuity it presents at that point.However the MCL, which uses several similarity measures (Fig. 8), is closer to the groundtruth.

Fig. 1 .
Fig. 1.Representation of the trajectory followed by the robot for taking the reference images.Sample omnidirectional images are shown.The route consists of a laboratory, a narrow corridor, a wide corridor, stairs, a hall, two short outdoor segments and a large one.The 3D-representation is courtesy of Juan Manuel Saez (University of Alicante).

Fig. 2 .
Fig.2.The omnidirectional mirror is oriented to capture the ground and 360 o of the surroundings.This orientation allows rotation invariance when rotating along the vertical axis.The omnidirectional mirror is the Remote Reality's OneShot360 and the video capturing device (mounted on the base of the lens) is a GreyPoint Flea2 firewire camera.

Fig. 3 . 4 Fig. 4 .
Fig.3.Entropy divergence analysis of a single feature: the Nitzberg filter for the reference images 1-80.It can be observed that the filter response does not present any significan maxima or minima.However there is a change in the variability near to image #30.The entropy divergence measure lets us notice the change by presenting a high gradient at that point.Region ARegion B

Fig. 5 .
Fig.5.Top: A representation of the Jensen-Rényi divergence at 30 different scales for some images from the sequence.We can see that the X positions of some discontinuities get displaced at different scales (the scales refer to the window size) while some others persist.We are interested in the latter ones.Bottom: three sample images corresponding to the persistent discontinuities of the upper plot.

Fig. 6 .
Fig.6.The gradient of the divergence along the whole sequence of reference images.Some of the peaks are selected as significant discontinuities according to their sharpness.

Fig. 7 .
Fig. 7. Left: The reference images which are selected as discontinuities are maked with a red circle.Blue boxes are an analogy with Fig. 1.Right: Some sample images.The third one corresponds to stairs, which cause several near to each other.
are a sequence of hidden parameters and I T 0 , I T 1 , • • • , I T k are the sequence of test images observed.Particularly we are interested in a marginal distribution of the posterior, which is called filtering distribution and is denoted as p(s T k |I T 0 , I T 1 , • • • , I T k ).From this distribution

Fig. 9 .
Fig.9.A Monte Carlo Localization trace of iterations 1, 5, 8 and 11.With the 11-th testing image the convergence to unimodality is achieved.The real position of the test images is represented with a red circle, while the particles are represented with blue asteriscs.The likelihood of each particle is also represented.

Fig. 10 .
Fig.10.Top: The response of a single similarity function trained for the whole sequence of images.Bottom: Localization results at the test image #131, given by: a) Single similarity measure for the whole set of images; the response is marked with a red cross.b) Monte Carlo Localization response, the particles are represented with vertical lines.c) Groundtruth, represented with a red circle.It can be observed that for the single similarity measure, the localization result is far from the groundtruth, due to the discontinuity it presents at that point.However the MCL, which uses several similarity measures (Fig.8), is closer to the groundtruth.