Sapiens by Meta AI: Foundation for Human Vision Models

Source: https://www.youtube.com/watch?v=eoJibBlexco

Author: AI Papers Academy

Published At: 2024-08-23T00:00:00

Length: 04:33

Summary

Description

In this video we dive into Sapiens, a new family of models for four fundamental human-centric tasks, presented by Meta AI in a recent research paper titled "Sapiens: Foundation for Human Vision Models".

The model's architecture is based on Vision Transformer (ViT), which for the first time pushed to train on 1K resolution images, x5 in size than DINOv2's input images size!

We cover the model's training process, which includes a self-supervised learning pretraining step, based on the masked-autoencoder (MAE) approach, which we also explain here.

For a quick background about Vision Transformers, watch the following short video - Introduction to Vision Transformers (ViT) | An Image is Worth 16x16 Words

Code - https://github.com/facebookresearch/sapiens

Paper page - https://arxiv.org/abs/2408.12569

-----------------------------------------------------------------------------------------------

✉️ Join the newsletter - https://aipapersacademy.com/newsletter/

Become a patron - https://www.patreon.com/aipapersacademy

👍 Please like & subscribe if you enjoy this content

-----------------------------------------------------------------------------------------------

Chapters:

0:00 Introduction

1:05 Humans-300M Dataset

1:54 Self-Supervised Pretraining

3:54 Task-specific Models