We use a vector implementation of capsules (
Sabour et al., 2017) where the magnitude of the vector represents the existence of the visual entity and the orientation characterizes its visual properties. The primary-level capsules are generated through a linear readout of the encoder RNN,
\(h^{enc}_t\). These capsules are meant to represent lower-level visual entities (“parts”) that belong to one of the higher-level capsules in the object capsule layer (“whole”). To find this part–whole relationship, we used the dynamic routing algorithm proposed by
Sabour et al. (2017). Dynamic routing is an iterative process where the assignments of parts to whole objects (coupling coefficients) are progressively determined by agreement between the two capsules (measured by the dot product between the two vector representations). Each primary level capsule (
i) provides a prediction for each object-level capsule (
j). These predictions are then combined using the coupling coefficients (
cij) to compute the object-level capsule. Then the agreement (dot product) between the object-level capsules and the predictions from each primary-level capsule impacts the coupling coefficients for the next routing step. For example, if the prediction for a digit capsule
j from a primary capsule
i, (
\(\hat{p}^{j|i}_t \leftarrow W^{ij}_t p^{i}_t\)), highly agrees with the computed digit capsule
\((\sum _{i}c^{ij}_t\hat{p}^{j|i}_t)\), the coupling coefficient
\(c^{ij}_t\) is enhanced so that more information is routed from primary capsule
i to object capsule
j. Coupling coefficients are normalized across the class capsule dimension following the max–min normalization (
Zhao, Kleinhans, Sandhu, Patel, & Unnikrishnan, 2019) as in
Equation 1. Lower and upper bounds for normalization,
lb and
ub, were set to 0.01 and 1.0. This routing procedure iterates three times. We used this method instead of the softmax normalization in
Sabour et al. (2017) because we observed the latter method would not differentiate between the coupling coefficients. In our experiments, we used 40 primary-level capsules, each a vector of size 8. The object capsules are vectors of size 16, and there are 10 of them corresponding to the 10-digit categories for the multiobject recognition task and 4 of them for the visual reasoning task. For the object-level capsules, we use a squash function (
Equation 2) to ensure that its vector magnitude is within the range of 0 to 1. For the mutliobject recognition task, these would represent the probability of a digit being present in the glimpse at each step. Once the routing is completed, we compute the vector magnitude (L2 norm) of each object capsule to obtain classification scores. The final digit classification is predicted based on the scores accumulated over all timesteps. For the visual reasoning task, two capsules (among four total) were designated to be the response capsules, and the cumulative magnitude of these capsules was used for predicting the same–different responses.
\begin{equation}
c^{ij}_t = lb + (ub-lb)\frac{c^{ij}_t-min(c^{ij}_t)}{max(c^{ij}_t)-min(c^{ij}_t)}
\end{equation}
\begin{equation}
d^{j}_t = \frac{\Vert v^j_t\Vert ^2}{1+\Vert v^j_t\Vert ^2}\frac{v^j_t}{\Vert v^j_t\Vert }
\end{equation}