We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Sapiens by Meta AI: Foundation for Human Vision Models
Summary
Description
In this video we dive into Sapiens, a new family of models for four fundamental human-centric tasks, presented by Meta AI in a recent research paper titled "Sapiens: Foundation for Human Vision Models".
The model's architecture is based on Vision Transformer (ViT), which for the first time pushed to train on 1K resolution images, x5 in size than DINOv2's input images size!
We cover the model's training process, which includes a self-supervised learning pretraining step, based on the masked-autoencoder (MAE) approach, which we also explain here.
For a quick background about Vision Transformers, watch the following short video - Introduction to Vision Transformers (ViT) | An Image is Worth 16x16 Words
Code - https://github.com/facebookresearch/sapiens
Paper page - https://arxiv.org/abs/2408.12569
-----------------------------------------------------------------------------------------------
✉️ Join the newsletter - https://aipapersacademy.com/newsletter/
Become a patron - https://www.patreon.com/aipapersacademy
👍 Please like & subscribe if you enjoy this content
-----------------------------------------------------------------------------------------------
Chapters:
0:00 Introduction
1:05 Humans-300M Dataset
1:54 Self-Supervised Pretraining
3:54 Task-specific Models
Translated At: 2025-03-02T03:38:44Z