1 Introduction
Deep Metric Learning (DML) is an important yet challenging topic in the Computer Vision community with numerous applications such as visual product search
[15, 18], multimodal retrieval [1, 31], face verification and clustering [22], person or vehicle identification [14, 38]. To deal with such applications, a DML method aims to learn an embedding space where all the visuallyrelated images (e.g., images of the same car model) are close to each other and dissimilar ones (e.g., images of two cars from the same brand but from different models) are far apart.Recent contributions in DML can be divided into three categories. A first category includes methods that focus on batch construction to maximize the number of pairs or triplets available to compute the similarity (e.g., Npair loss [23]
). A second category involves the design of loss functions to improve the generalization (
e.g., binomial deviance [26]). The third category covers ensemble methods that tackle the embedding space diversity (e.g., BIER [19]).This similarity metric is trained jointly with the image representation which is computed using deep neural network architectures such as GoogleNet
[25] or BNInception [8]. For all of these networks, the image representations are obtained by the aggregation of the deep features using a Global Average Pooling [37]. Hence, the deep features are summarized using the sample mean, and the training process makes sure that the sample mean is discriminative enough for the target task.Our insight is that ignoring the characteristics of the deep feature distribution leads to a lack of distinctiveness in the deep features. We illustrate this phenomenon in Figure 4. In Figure a, we train a DML model on MNIST and plot both the deep features and the image representations from a set of images sampled from the training set. We observe that the representations are perfectly organized while the deep features are in contrast scattered in the entire space. As the representations are obtained using the sample mean only, they are sensitive to outliers or sampling problems (occlusions, illumination, background variation, etc.), which we refer to as the scattering problem. We illustrate this problem in Figure b where the representations are computed using the same architecture but by sampling only th of the original deep features. As we can see, the resulting representations are no longer correctly organized.
In this paper, we propose HORDE
, a HighOrder Regularizer for Deep Embeddings which tackles this scattering problem. By minimizing (resp. maximizing) the distance between highorder moments of the deep feature distributions, this DML regularizer enforces deep feature distributions from similar (resp. dissimilar) images to be nearly identical (resp. to not overlap). As illustrated in
Figure c, our HORDE regularizer produces well localized features, leading to robust image representations even if they are computed using only th of the original deep features.Our contributions are the following: First, we propose a HighOrder Regularizer for Deep Embeddings (HORDE) that reduces the scattering problem and allows the sample mean to be a robust representation. We provide a theoretical analysis in which we support this claim by showing that HORDE is a lower bound of the Wasserstein distance between the deep feature distributions while also being an upperbound of their Maximum Mean Discrepancy. Second, we show that HORDE consistently improves DML with varying loss functions, even when considering ensemble methods. Using HORDE, we are able to obtain state of the art results on four standard DML datasets (Cub2002011 [27], Cars196 [12], InShop Clothes Retrieval [15] and Stanford Online Products [18]).
The remaining of this paper is organized as follows: In section 2, we review recent works on deep metric learning and how our approach differs. In section 3, after an overview of our proposed method, we present the practical implementation of HORDE as well as a theoretical analysis. In section 4
we compare our proposed architecture with the stateoftheart on four image retrieval datasets. We show the benefit of
HORDE regularization for different loss functions and an ensemble method. In section 5 we conduct extensive experiments to demonstrate the robustness of our regularization and its statistical consistency.2 Related Work
In DML, we jointly learn the image representations and an embedding in such a way that the Euclidean distance corresponds with the semantic content of the images. Current approaches use a pretrained CNN to produce deep features, then they aggregate these features using Global Average Pooling [37]. Finally they learn the target representation with a linear projection. The whole network is finetuned to solve the metric learning task according to three criteria: a loss function, a sampling strategy and an ensemble method.
Regarding the loss function, popular approaches consider pairs [3] or triplets [22] of similar/dissimilar samples. Recent works generalize these loss functions to larger tuples [2, 18, 23, 26] or improve the design [28, 29, 34]. The sampling of the training tuples receive plenty of attention [18, 22, 23], either through mining [7, 22], proxy based approximations [16, 17] or hard negative generation [4, 13]. Finally, ensemble methods have recently become an increasingly popular way of improving the performances of DML architectures [11, 19, 33, 35]. Our proposed HORDE regularizer is a complementary approach. We show in section 4 that it consistently improves these popular DML models.
Recent approaches also consider a distribution analysis for DML [21, 13]. Contrarily to us, they only consider the distribution of the representations to design a loss function or a hard negative generator but they do not take into account the distribution of the underlying deep features. Consequently, they do not address the scattering problem. More precisely, Magnet loss [21] proposes to better represent a given class manifold by learning a mode distribution instead of the standard unimode assumption. To that aim, the perclass distribution is approximated using means clustering. The proposed loss tries to minimize the distance between a representation and its nearest class mode and tries to maximize the distance between all modes of all other classes. However, since the magnet loss is directly applied to the sample means of the deep features, it leads to the scattering problem illustrated in Figure 4. In DVML [13]
, the authors assume that the representations follow a perclass Gaussian distribution. They propose to estimate the parameters of these distributions using a variational autoencoder approach. Then, by sampling from a Gaussian distribution with the learned parameters, they are able to generate artificial hard samples to train the network. However, no assumption is made on the distribution of the deep features, which leads to the scattering problem illustrated in
Figure 4 (see also [13], Figure 1). In contrast, we show that focusing on the distribution of the deep features reduces the scattering problem and improves the performances of DML architectures.In the next section, we first give an overview of the proposed HORDE regularization. Then, we describe the practical implementation of the highorder moments computation. Finally, we give theoretical insights which support the regularization effect of HORDE.
3 Proposed HighOrder Regularizer
We first give an overview of the proposed method in Figure 5. We start by extracting a deep feature map of size using a CNN where and are the height and width of the feature map and is the deep features dimension. Following standard DML practices, these features are aggregated using a Global Average Pooling to build the image representation and are projected into an embedding space before a similaritybased loss function is computed over these representations (topright blue box in Figure 5).
In HORDE, we directly optimize the distribution of the deep features by minimizing (respectively maximizing) a distance between the deep feature distributions of similar images (respectively dissimilar images). We approximate the deep feature distribution distance by computing highorder moments (bottomright red box in Figure 5). We recursively approximate the highorder moments and we compute an embedding after each of these approximations. Then, we apply a DML loss function on each of these embeddings.
3.1 Highorder computation
In practice, the computation of highorder moments is very intensive due to their high dimension. Furthermore, it has been shown in [9, 19] that an independence assumption over all highorder moment components is unrealistic. Hence, we rely on factorization schemes to approximate their computation, such as Random Maclaurin (RM) [10]
. The RM algorithm relies on a set of random projectors to approximate the inner product between two highorder moments. In the case of the secondorder, we sample two independent random vectors
whereis a uniform distribution in
. For two non random vectors and , the inner product between their secondorder moments can be approximated as:(1) 
where is the Kronecker product, is the expectation over the random vectors and which follow the distribution and . This approach easily holds to estimate any inner product between th moments:
(2) 
where is computed as:
(3) 
In practice, we approximate the expectation of this quantity by using the sample mean over sets of these random projectors. That is, we sample independent random matrices and we compute the vector that approximates the th moments of with the following equation:
(4) 
where is the Hadamard (elementwise) product. Thus, the inner product between the th moments is:
(5) 
However, Random Maclaurin produces a consistent estimator independently of the analyzed distributions, and thus also encodes non informative highorder moment components. To ignore these noninformative components, the projectors can be learned from the data. However, the high number of parameters in makes it difficult to learn a consistent estimator, as we empirically show in subsection 5.2. We solve this problem by computing the highorder moment approximation using the following recursion:
(6) 
This last equation leads to the proposed cascaded architecture for HORDE summarized in Algorithm 1. We empirically show in subsection 5.2 that this recursive approach produces a consistent estimator of the informative highorder moment components.
Then, the HORDE regularizer consists in computing a DMLlike loss function on each of the highorder moments, such that similar (respectively dissimilar) images have similar (respectively dissimilar) highorder moments:
(7) 
In practice, we cannot compute the expectation since the distribution of is unknown. We propose to estimate it using the empirical estimator:
(8) 
where and
are the sets of deep features extracted from images
and .Hence, the DML model is trained on a combination of a standard DML loss and the HORDE regularizer on pairs of images and :
(9) 
This can easily be extended to any tuple based loss function. In practice, we use the same DML loss function for HORDE ().
Remark also that at inference time, the image representation consists only of the sample mean of the deep features:
(10) 
and the HORDE part of the model can be discarded.
3.2 Theoretical analysis
In this section, we show that optimizing distances between highorder moments is directly related to the Maximum Mean Discrepancy (MMD) [6] and the Wasserstein distance. We consider the Reproducing Kernel Hilbert Space (RKHS) of distributions defined on the compact , endowed with the Gaussian kernel . An image is then represented as a distribution from which we can sample a set of deep features . We denote the expectation of sampled from . The highorder moments are denoted using their vectorized forms, that is where , etc. By extension, we use for the mean. We assume that all moments exist for every distributions in and we note, :
(11) 
Following [6], the MMD between two distributions and is expressed as:
(12) 
The MMD searches for a transform that maximizes the difference between the expectation of two distributions. Intuitively, a low MMD implies that both distributions are concentrated in the same regions of the feature space.
In the following theorem, we show that the distance over highorder moments is an upperbound of the squared MMD (the proof mainly follows [6]):
Theorem 1.
There exists such that, for every distributions , the MMD is bounded from above by the first moments of and by:
(13) 
Proof.
As the MMD is a distance on the RKHS [6], the square of the MMD can be rewritten such as:
(14) 
where is defined using the kernel trick . Then, we can approximate the Gaussian kernel using its Taylor expansion:
(15) 
where . Thus, we can define as the direct sum of all weighted and vectorized moments:
(16) 
As all moments exist, we can swap the expectation and the direct sum. Moreover, since the sequence when and the moments are bounded by , the higherorder moment contributions become negligible compared to the first moments. Thus, we have:
(17) 
where . ∎
Cub2002011  Cars196  
Backbone  R@  1  2  4  8  16  32  1  2  4  8  16  32 
Loss functions or mining strategies  
GoogleNet  Angular loss [29]  54.7  66.3  76.0  83.9      71.4  81.4  87.5  92.1     
HDML [36]  53.7  65.7  76.7  85.7      79.1  87.1  92.1  95.5      
DAMLRMM [32]  55.1  66.5  76.8  85.3      73.5  82.6  89.1  93.5      
DVML [13]  52.7  65.1  75.5  84.3      82.0  88.4  93.3  96.3      
HTL [5]  57.1  68.8  78.7  86.5  92.5  95.5  81.4  88.0  92.7  95.7  97.4  99.0  
contrastive loss (Ours)  55.0  67.9  78.5  86.2  92.2  96.0  72.2  81.3  88.1  92.6  95.6  97.8  
contrastive loss + HORDE  57.1  69.7  79.2  87.4  92.8  96.3  76.2  85.2  90.8  95.0  97.2  98.8  
Triplet loss (Ours)  50.5  63.3  74.8  84.6  91.2  95.0  65.2  75.8  83.7  89.4  93.6  96.5  
Triplet loss + HORDE  53.6  65.0  76.0  85.2  91.1  95.3  74.0  82.9  89.4  93.7  96.4  98.0  
Binomial Deviance (Ours)  55.9  67.6  78.3  86.4  92.3  96.1  78.2  86.0  91.3  94.6  97.1  98.3  
Binomial Deviance + HORDE  58.3  70.4  80.2  87.7  92.9  96.3  81.5  88.5  92.7  95.4  97.4  98.6  
Binomial Deviance + HORDE  59.4  71.0  81.0  88.0  93.1  96.5  83.2  89.6  93.6  96.3  98.0  98.8  
BNInception  Multisimilarity loss [30]  65.7  77.0  86.3  91.2  95.0  97.3  84.1  90.4  94.0  96.5  98.0  98.9 
contrastive loss + HORDE  66.3  76.7  84.7  90.6  94.5  96.7  83.9  90.3  94.1  96.3  98.3  99.2  
contrastive loss + HORDE  66.8  77.4  85.1  91.0  94.8  97.3  86.2  91.9  95.1  97.2  98.5  99.4  
Ensemble Methods  
GoogleNet  HDC [35]  53.6  65.7  77.0  85.6  91.5  95.5  73.7  83.2  89.5  93.8  96.7  98.4 
BIER [19]  55.3  67.2  76.9  85.1  91.7  95.5  78.0  85.8  91.1  95.1  97.3  98.7  
ABIER [20]  57.5  68.7  78.3  86.2  91.9  95.5  82.0  89.0  93.2  96.1  97.8  98.7  
ABE [11]  60.6  71.5  79.8  87.4      85.2  90.5  94.0  96.1      
ABE (Ours)  60.0  71.8  81.4  88.9  93.4  96.6  79.2  87.1  92.0  95.2  97.3  98.7  
ABE + HORDE  62.7  74.3  83.4  90.2  94.6  96.9  86.4  92.0  95.3  97.4  98.6  99.3  
ABE + HORDE  63.9  75.7  84.4  91.2  95.3  97.6  88.0  93.2  96.0  97.9  99.0  99.5 
Stanford Online Products  InShop Clothes Retrieval  
Backbone  R@  1  10  100  1000  1  10  20  30  40  50 
GoogleNet  Angular loss [29]  70.9  85.0  93.5  98.0             
HDML [36]  68.7  83.2  92.4                
DAMLRMM [32]  69.7  85.2  93.2                
DVML [13]  70.2  85.2  93.8                
HTL [5]  74.8  88.3  94.8  98.4  80.9  94.3  95.8  97.2  97.4  97.8  
Binomial Deviance (Ours)  67.4  81.7  90.2  95.4  81.3  94.2  95.9  96.7  97.2  97.6  
Binomial Deviance + HORDE  72.6  85.9  93.7  97.9  84.4  95.4  96.8  97.4  97.8  98.1  
BNInception  Multisimilarity loss [30]  78.2  90.5  96.0  98.7  89.7  97.9  98.5  98.8  99.1  99.2 
contrastive loss + HORDE  80.1  91.3  96.2  98.7  90.4  97.8  98.4  98.7  98.9  99.0 
This result implies that regularizing highorder moments to be similar enforces similar images to have deep features sampled from similar distributions. Thus, deep features from similar images have a higher probability of being concentrated in the same regions of the feature space.
Next, we show a converse relation between highorder moments and the Wasserstein distance:
Theorem 2.
There exists such that, for every distributions , the squared Wasserstein distance is bounded from below by the first moments of and by:
(18) 
Proof.
Similarly to the Theorem 1, we can lowerbound the Gaussian kernel using its Taylor expansion:
where and . Then, by using the definition of from Equation 16, a lowerbound for the MMD is:
(19) 
where . Finally, the MMD is a lowerbound of the Wasserstein distance [24]:
(20) 
By combining subsection 3.2 and Equation 20, we get the expected lowerbound:
(21) 
where . ∎
Hence, regularizing highorder moments to be dissimilar enforces dissimilar images to have deep features sampled from different distributions. As such, deep features are more distinctive as they are sampled from different regions of the feature space for dissimilar images. This is illustrated in Figure c () compared to Figure a ().
1  2  3  4  5  6  

1  1  2  1  2  3  1  2  3  4  1  2  3  4  5  1  2  3  4  5  6  
R@1  55.9  57.8  58.6  56.8  58.0  56.9  57.8  58.8  57.6  56.1  57.4  57.7  56.8  56.3  53.3  57.4  57.9  57.1  55.6  54.4  50.7 
R@2  67.6  69.5  70.4  68.1  69.4  68.7  69.2  70.6  70.0  68.5  68.8  69.9  69.3  68.1  65.4  69.9  70.6  70.5  68.9  66.2  63.0 
R@4  78.3  79.0  79.8  78.3  78.8  78.1  78.6  79.9  79.2  78.1  78.7  78.8  79.2  78.0  75.9  79.4  80.0  79.9  78.7  76.5  74.0 
R@8  86.4  86.7  87.2  86.2  86.7  86.6  86.5  87.2  87.0  85.5  87.0  87.1  87.1  86.5  84.2  86.9  87.4  87.4  86.7  85.4  82.5 
k  1  2  3  4  5  6  

n  1  1  2  1  2  3  1  2  3  4  1  2  3  4  5  1  2  3  4  5  6 
R@1  55.9  57.0  53.4  57.6  54.7  50.6  57.9  55.4  52.3  47.6  58.1  55.9  53.1  48.4  43.7  58.4  55.7  52.9  47.8  43.9  40.5 
R@2  67.6  68.3  65.4  69.9  67.0  63.0  69.5  67.1  65.0  60.2  70.3  67.7  65.0  60.8  56.0  69.9  67.6  64.9  59.9  56.0  53.0 
R@4  78.3  78.3  75.8  79.1  76.8  73.6  79.6  77.5  75.2  71.0  79.9  78.2  75.5  72.8  67.2  79.8  78.0  75.6  70.2  67.2  64.7 
R@8  86.4  86.2  84.2  87.0  84.7  82.4  87.1  85.8  83.6  80.2  87.1  85.2  83.9  81.7  78.0  87.3  85.6  83.8  79.6  77.5  75.2 
k  1  2  3  4  5  6  

n  1  1  2  1  2  3  1  2  3  4  1  2  3  4  5  1  2  3  4  5  6 
R@1  55.9  57.0  53.4  57.9  56.1  54.2  57.6  55.4  54.3  53.0  58.3  56.3  56.0  54.7  52.4  57.9  56.6  55.8  55.0  53.9  51.6 
R@2  67.6  68.3  65.4  69.4  67.9  66.2  69.3  67.2  66.0  65.2  70.4  68.7  68.1  66.9  64.7  69.5  68.8  68.3  67.7  65.2  64.0 
R@4  78.3  78.3  75.8  79.2  77.8  76.4  79.5  77.2  77.0  75.8  80.2  78.5  78.3  76.9  75.6  79.6  76.6  77.9  77.9  75.3  74.4 
R@8  86.4  86.2  84.2  86.6  85.3  84.4  87.1  85.6  84.4  84.1  87.7  86.3  86.0  85.4  84.1  87.0  86.4  85.6  84.8  84.0  83.7 
4 Comparison to the stateoftheart
We present the benefits of our method by comparing our results with the stateoftheart on four datasets, namely CUB2002011 (CUB) [27], Cars196 (CARS) [12], Stanford Online Products (SOP) [18] and InShop Clothes Retrieval (INSHOP) [15]. We report the Recall@K (R@K) on the standard DML splits associated with these datasets. Following standard practices, we use GoogleNet [25] as a backbone network and we add a fully connected layer at the end for the embedding. For CUB and CARS, we train HORDE using 5 highorder moments with 5 classes and 8 images per instance per batch. For SOP and INSHOP, we use 4 highorder moments with a batch size of 2 images and 40 different classes as there are classes with only 2 images in these datasets. We use crops and the following data augmentation at training time: multiresolution where the resolution is uniformly sampled in of the crop size, random crop and horizontal flip. At inference time, we only use the images resized to . For HORDE, we use 8192 dimensions for all highorder moments and we fix all embedding dimensions to 512. Finally, we take advantage of the highorder moments at testing time by concatenating them together. To be fair with other methods, we reduce their dimensionality to 512 using a PCA. These results are annotated with a .
First, we show in the upper part of Table 1 that HORDE significantly improves three popular baselines (contrastive loss, triplet loss and binomial deviance). These improvements allow us to claim state of the art results for single model methods on CUB with % R@1 (compared to % R@1 for HTL [5]) and second best for CARS.
We also present ensemble method results in the second part of Table 1. We show that HORDE is also a benefit to ensemble methods by improving ABE [11] by R@1 on CUB and R@1 on CARS. To the best of our knowledge, this allows us to outperform the state of the art methods on both datasets with R@1 on CUB and R@1 on CARS, despite our implementation of ABE underperforming compared to the results reported in [11].
Note that both single models and ensemble ones are further improved by using the highorder moments at testing: +1.1% on CUB and +1.7% on CARS for the single models + HORDE and +1.2% on CUB and +1.6% on CARS for ABE + HORDE.
Furthermore, we show that HORDE generalizes well to large scale datasets by reporting results on SOP and INSHOP in Table 2. HORDE improves our baseline binomial deviance by R@1 on SOP and R@1 on INSHOP. This improvement allows us to claim state of the art results for single model methods on INSHOP with R@1 (compared to R@1 for HTL) and second best on SOP with R@1 (compared to R@1 for HTL). Remark also that HORDE outperforms HTL on 3 out of 4 datasets.
We also report some results with the BNInception [8]. Our model trained with HORDE and contrastive loss leads to similar results compared to the recent MS loss with mining [30] on smaller datasets while on larger datasets we outperform it by 1.9% on SOP and by 0.7% on INSHOP. By using the highorder moments are testing, performances are further increased and outperforms MS loss with mining by 1.1% on CUB and by 2.1% on CARS.
Finally, we show some example queries and their nearest neighbors in Figure 8 on the test split of CUB.
5 Ablation study
In this section, we provide an ablation study on the different contributions of this paper. We perform 3 experiments on the CUB dataset [27]. The first experiment shows the impact of highorder regularization on a standard architecture while the highorder moments are consistently approximated using the Random Maclaurin approximation. The second experiment illustrates the benefit of learning the highorder moments projection matrices. The last experiment confirms the statistical consistency of our cascaded architecture when the parameters are learned.
5.1 Regularization effect
In this section, we assess the regularization impact of HORDE. To that aim, we use the baseline detailed in section 4 and we train the architecture with a number of highorder moments varying from 2 to 6. In this first experiment, the computation of the highorder moments does not rely on the cascade computation approach of Equation 6. Instead, the matrices to approximate the highorder moments are untrainable and sampled using the Random Maclaurin method of Equation 4. Remark also that the embedding layers on all highorder moments are not added. We use the binomial deviance loss with the standard parameters [26]. The results are shown in Table 3.
First, we can see that HORDE consistently improves the baseline from 1% to 2% in R@1. These results corroborate the insights of our theoretical analysis in section 3 and also provide a quantitative evaluation of the behavior observed in Figure 4 on the retrieval ranking. When considering the highorder moments as representations, we observe improved results with respect to the baseline for orders 2 and 3. Note however that the reported highorder results are not comparable to the first order as the similarity measure is computed on the 8192 dimensional representations. While adding orders higher than 2 does not seem interesting in terms of performances, we found that the training process is more stable with 5 or 6 orders than only 2. This is observed in practice by measuring the Recall@K with which tend to vary less between training steps. Moreover on the CUB dataset, while the baseline requires around 6k steps to reach the best results, we usually need 1k steps less to reach higher accuracy with HORDE.
5.2 Statistical consistency
To evaluate the impact of estimating only informative highorder moments, we first train the projection matrices and the embeddings but without the cascade architecture and report the results in Table 4.
In this second experiment, we empirically show that such scheme also increases the baseline by at least 1% in R@1. Notably, by focusing on the most informative highorder moment components, HORDE further improves the performances of the untrainable HORDE from 57.8% to 58.4%. However, the retrieval performances of the highorder representations are heavily degraded compared to Table 3. We interpret these results as an inconsistent estimations of the highorder moments due to overfitting the model. For example, the 6% loss in R@1 for the thirdorder moment between the first and the second experiments suggests a reduced interest for even higherorder moments.
For the third experiment, we report the results of our cascaded architecture in Table 5. Interestingly, the highorder moments computed from the cascaded architecture perform almost identically to those computed from the untrained method Table 3 but with a smaller dimension. Moreover, we keep the performance improvement of the second experiments of Table 4. This confirms that the proposed cascaded architecture does not overfit its estimations of the highorder moments while still improving the baseline. Finally, this cascaded architecture only produces a small computational overhead during the training compared to the architecture without the cascade.
6 Conclusion
In this paper, we have presented HORDE, a new deep metric learning regularization scheme which improves the distinctiveness of the deep features. This regularizer, based on the optimization of the distance between the distributions of the deep features, provides consistent improvements to a wide variety of popular deep metric learning methods. We give theoretical insights that show HORDE upperbounds the Maximum Mean Discrepancy and lowerbounds the Wasserstein distance. The computation of highorder moments is tackled using a trainable Random Maclaurin factorization scheme which is exploited to produce a cascaded architecture with small computation overhead. Finally, HORDE achieves very competitive performances on four well known datasets.
Acknowledgements
Authors would like to acknowledge the COMUE Paris Seine University, the CergyPontoise University and M2M Factory for their financial and technical support.
References
 [1] (2018) Crossmodal retrieval in the cooking context: learning semantic textimage embeddings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, Cited by: §1.

[2]
(201707)
Beyond triplet loss: a deep quadruplet network for person reidentification.
In
The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
, Cited by: §2.  [3] (2005) Learning a similarity metric discriminatively, with application to face verification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [4] (201806) Deep adversarial metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [5] (201809) Deep metric learning with hierarchical triplet loss. In The European Conference on Computer Vision (ECCV), Cited by: Table 1, Table 2, §4.
 [6] (2007) A kernel method for the twosampleproblem. In Advances in neural information processing systems, pp. 513–520. Cited by: §3.2, §3.2, §3.2, §3.2.
 [7] (201710) Smart mining for deep metric learning. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.

[8]
(201507)
Batch normalization: accelerating deep network training by reducing internal covariate shift.
In
Proceedings of the 32nd International Conference on Machine Learning
, Cited by: §1, §4.  [9] (201210) Negative evidences and cooccurrences in image retrieval: the benefit of PCA and whitening. In The European Conference on Computer Vision (ECCV), Cited by: §3.1.

[10]
(201204)
Random feature maps for dot product kernels.
In
Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics
, Cited by: §3.1.  [11] (201809) Attentionbased ensemble for deep metric learning. In The European Conference on Computer Vision (ECCV), Cited by: §2, Table 1, §4.
 [12] (201312) 3D object representations for finegrained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR13), Cited by: §1, §4.
 [13] (201809) Deep variational metric learning. In The European Conference on Computer Vision (ECCV), Cited by: §2, §2, Table 1, Table 2.
 [14] (201606) Deep relative distance learning: tell the difference between similar vehicles. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [15] (201606) DeepFashion: powering robust clothes recognition and retrieval with rich annotations. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §4.
 [16] (201710) No fuss distance metric learning using proxies. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2.
 [17] (201707) Deep metric learning via facility location. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [18] (201606) Deep metric learning via lifted structured feature embedding. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §1, §2, §4.
 [19] (201710) BIER  boosting independent embeddings robustly. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1, §2, §3.1, Table 1.
 [20] (2018) Deep metric learning with BIER: boosting independent embeddings robustly. IEEE transactions on pattern analysis and machine intelligence. Cited by: Table 1.
 [21] (201605) Metric learning with adaptive density discrimination. International Conference on Learning Representations (ICLR). Cited by: §2.

[22]
(201506)
FaceNet: a unified embedding for face recognition and clustering
. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.  [23] (201612) Improved deep metric learning with multiclass npair loss objective. In Advances in Neural Information Processing Systems 29, Cited by: §1, §2.
 [24] (201008) Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res.. Cited by: §3.2.
 [25] (201506) Going deeper with convolutions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §4.
 [26] (201612) Learning deep embeddings with histogram loss. In Advances in Neural Information Processing Systems 29, Cited by: §1, §2, §5.1.
 [27] (2011) The CaltechUCSD Birds2002011 Dataset. Technical report Technical Report CNSTR2011001, California Institute of Technology. Cited by: §1, §4, §5.
 [28] (201806) CosFace: large margin cosine loss for deep face recognition. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.
 [29] (201710) Deep metric learning with angular loss. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2, Table 1, Table 2.
 [30] (201906) Multisimilarity loss with general pair weighting for deep metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Table 1, Table 2, §4.
 [31] (201806) Bidirectional retrieval made simple. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.
 [32] (201906) Deep asymmetric metric learning via rich relationship mining. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4076 – 4085. Cited by: Table 1, Table 2.
 [33] (201809) Deep randomized ensembles for metric learning. In The European Conference on Computer Vision (ECCV), Cited by: §2.
 [34] (201809) Correcting the triplet selection bias for triplet loss. In The European Conference on Computer Vision (ECCV), Cited by: §2.
 [35] (201710) Hardaware deeply cascaded embedding. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §2, Table 1.
 [36] (201906) Hardnessaware deep metric learning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 72 – 81. Cited by: Table 1, Table 2.
 [37] (201606) Learning deep features for discriminative localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1, §2.
 [38] (201710) Efficient online local metric adaptation via negative samples for person reidentification. In The IEEE International Conference on Computer Vision (ICCV), Cited by: §1.
Comments
There are no comments yet.