CLIP Latent Space — Conformity & Likelihood (ViT-L/14)
This Space operates on CLIP ViT-L/14 latent space to compute two metrics per modality:
- Conformity — measure how common the samle is (based on The Double-Ellipsoid Geometry of CLIP)
- Log-Likelihood — measure how like the common is (based on Whitened CLIP as a Likelihood Surrogate of Images and Captions)
All modality means and W matrices are stored internally and loaded from w_mats/*.pt
.
Data provenance
Modality means and precision matrices (W) are computed from MS-COCO features.
They are loaded from precomputed .pt
files in the Space repo.
Result will appear here
Implementation details:
- Embeddings:
openai/clip-vit-large-patch14
via 🤗 Transformers; features are L2-normalized. - Conformity: cosine similarity to stored modality means
mu_image
,mu_text
. - Log-likelihood:
-0.5 * (x-mu)^T W (x-mu)
using MS-COCO-based precisionW
.