CLIP Latent Space — Conformity & Likelihood (ViT-L/14)

This Space operates on CLIP ViT-L/14 latent space to compute two metrics per modality:

  1. Conformity — measure how common the samle is (based on The Double-Ellipsoid Geometry of CLIP)
  2. Log-Likelihood — measure how like the common is (based on Whitened CLIP as a Likelihood Surrogate of Images and Captions)

All modality means and W matrices are stored internally and loaded from w_mats/*.pt.

Data provenance
Modality means and precision matrices (W) are computed from MS-COCO features.
They are loaded from precomputed .pt files in the Space repo.

Modality

Result will appear here

Implementation details:

  • Embeddings: openai/clip-vit-large-patch14 via 🤗 Transformers; features are L2-normalized.
  • Conformity: cosine similarity to stored modality means mu_image, mu_text.
  • Log-likelihood: -0.5 * (x-mu)^T W (x-mu) using MS-COCO-based precision W.