CV4Animals 2023: The state of the art in quantifying animal movement and behavior

A writeup of the CV4Animals 2023 workshop
pose estimation

Lili Karashchuk


June 19, 2023

There has been a lot of work in the past decade on quantifying animal movement and behavior. For the past 3 years, computer vision researchers have gathered together to review progress in this endeavor. This is the CV4Animals workshop.

I was excited to attend the CV4Animals 2021 workshop two years ago and I’m excited once again to attend CV4Animals 2023, this time in person for the first time.

It is much nicer to talk to people in person at posters and catch up with talk presenters afterwards. However, walking back and forth across the conference center multiple times to catch the talks then posters several times was quite taxing. I was impressed that so many people stayed for the workshop despite this journey. Such is dedication of animal vision scientists!

The room was completely full for the whole duration of the workshop, to the extent people ended up hosting an overflow room!

The workshop itself feels bigger and with more diverse computer vision problems and animals compared to 2021. While drafting this post, I realized I was covering much more work and themes than for my 2021 post. I wonder if I will have to cut down more dramatically for CV4Animals 2025.

So anyway, here are the themes I saw in the featured talks and in the poster session.

Vision in the wild

(Above) Workflow for tracking animals using a deep sea autonomous vehicle.
(Below) Examples of some images tracked in the wild. Both from Katija et al, 2021.

By far the biggest emphasis in the main sequence of talks was applying computer vision to images or videos of animals in the wild. Three of the keynote talks fell in this theme. Tanya Berger-Wolf (and her student Mohannad Elhamod) talked about network to map from images to a latent space that could be matched to known biological structures such as phylogeny.

Kakani Katija gave a memorable talk about the videos the Monterey Bay Aquarium Research Institute (MBARI) is collecting in the deep sea, plans for collecting more videos at large scale with autonomous underwater vehicles, and new initiatives for annotating all of this data. Along this line, Devis Tua talked about mapping coral reefs using a GoPro. They used photogrammetry along with some segmentation to remove fish, humans, and other things that are not coral reefs.

There were a few posters on this topic as well:

These wild settings are really at the frontier of computer vision. “Foundation” models typically don’t work in these settings, as the videos may be in odd poses or lower quality (due to constraints of being in the field) and are often not in web datasets. Collecting in-depth annotated data is also time-consuming or sometimes impossible in these settings. The proposed ways to tackle these issues are to (1) use the known structures from biology, (2) rely on large scale human annotation, and (3) teach people how to collect better data at scale. These approaches make sense with the current technology.

I wonder if we can also make unsupervised approaches or synthetic data good and easy enough to make it a more standard approach in these novel environments. There may be ways to fine-tune the large models for these environments while still taking advantage of general vision priors.

Unsupervised approaches

Hi-LASSIE, a method to recover a 3D articulated skeleton from a dozen images of an animal. The key idea is to cluster based on DINO features and train a neural network to represent the surface. This doesn’t take any annotations and I think showcases a possible future of 3D shape estimation without any keypoint annotations. Paper is by Yao et al, 2023.

So with that, it was interesting to learn about new approaches to study animal pose and behavior without any annotations. I must admit that, to me, this is the most exciting area of computer vision for animals, as the annotation process is still wayy too laborious. As in 2021, this is still an under-represented area but I hope to see it grow more.

There were no talks on unsupervised or synthetic data approaches, but here are some posters I saw at CV4Animals and CVPR:

The segment anything model takes a point in an image and returns a set of possible masks around that point. Each column in the image shows 3 valid masks generated by SAM from a single ambiguous point prompt (green circle). Note how it can pick out different parts of ostrich (full ostrich, top body, and head).

Some of this work is driven by new big models that came out in the past year. From N to N+1 (above) uses the Segment Anything Model (SAM) that came out in April 2023. SAM takes an image and a point in an image and can produce a segmentation mask. SAM works well for natural scenes, but may need to be fine-tuned for harder images experimental settings. I do think SAM will help a lot with removing backgrounds that make tracking in natural scenes challenging.

The DINOv2 network has learned to compute features for image patches. Here, the features are reduced using PCA to 3 dimensions and shown as 3 colors (input image on the left, computed features on the right). It seems to have learned some representation of horse body parts, across a variety of different horse images.

Hi-LASSIE and MagicPony build upon DINO features that came out in April 2021. As of April 2023, we now have DINOv2 features, which are even more robust and seem to work quite well on animals in natural scenes. I’m excited to see where the animal vision community will take these in the next few years.

The paper Tracking Everything Everywhere All at Once came out in June 2023. It proposes a new technique where the user specifies points in an initial image then it tracks them through a video. The Improving Unsupervised Label Propagation poster works in a similar setting, but the “Tracking Everything” method seems much more robust. I think this make it much easier to do pose estimation ad-hoc in short videos as well as reduce the need for annotation further.

Synthetic data

All of these images are synthetically generated! These are made by the Infinigen procedural generator.

I didn’t see any papers on synthetic data at CV4Animals 2023. Did I miss some? Their absence seemed glaring, as these were quite popular in 2021. At the panel, someone asked what people think is the role of synthetic data going forward. The panelists seemed skeptical that it could work in the general case.

Personally, I think it could still be viable, but there needs to be an easy-to-use interface for scientists to simulate their animal of interest. Infinigen (also presented at CVPR 2023) seems like a solid step towards direction. Overall, I’m not sure whether the simulation problem is easier or harder than solving just reducing the annotation load through unsupervised and semi-supervised approaches. We’ll have to see which approach ends up being easier for scientists to apply in the next few years. Perhaps it will be some combination of these on a case-by-case basis.

New datasets

The 3D-POP dataset, which provides ground truth data (from markers) of multiple pigeons in 3D. As far as I know it is the only multi-animal dataset with more than 2 animals, and really the only multi-animal dataset besides PAIR-R24M

There were a bunch of new datasets at this session. In her keynote talk, Kakani Katija talked a bit about FathomNet, a crowdsourced image database of ocean creatures. There was a brief oral presentation about PanAf20K, a video dataset of ape behavior in the wild (with bounding box and behavior annotations).

Besides the above, these posters presented a variety of different datasets.

Behavior datasets:

Pose estimation datasets:



An overview of MammalNet, 1 of 3 huge behavior datasets presented at CV4Animals. MammalNet has 18 thousand videos of 173 different mammals classified into 12 different behaviors.

In 2021 I wrote “There aren’t the same level of datasets with animal video behavior annotations as there are for animal keypoints.” Well it’s 2023 and that has changed. We have so many animal behavior videos! The scale of MammalNet (18k videos), PanAf20K (20k videos), and Animal Kingdom (33k videos) boggles the mind. I wonder if this will usher a bigger interest in animal behavioral analysis within the computer vision community. Additionally, with unsupervised pose estimation approaches becoming useable (see above), perhaps these datasets could even be a source for understanding the fine kinematics of diverse behaviors.

Multi-animal settings and animal identification

A pipeline to identify individual cattle from images, by Ramesh et al

Kostas Daniilidis started the session off with a 3D multi-animal tracking problem: cowbirds in an aviary. According to him, the multi-animal problem here is not adequately solved, but at least they did develop some nice benchmarks.

Again, here are some relevant posters:

It’s interesting to see 4 different re-identification posters, with 2 on cattle. Each method seems so different too? I don’t know what to make of the re-identification literature. Seems like a framework needs to emerge for this still.

Failure cases of multi-animal tracking using centroid, from Xiao et al, 2022. (a) Inseparable pointcloud due to occlusions. (b) Merged/split clusters due to shape change of an individual at different instants of time, which could result in ghost trajectories. (c) Identity Switch. At first, the blue hypothesis is correctly tracking the ground truth blue bird. After a few frames, though, the blue bird and the red bird cross paths and blue hypothesis follows the wrong target. (d) Ghost trajectory resulting from false positive detections, eg. shadows of a bird

I can’t tell to what extent multi-animal tracking is solved. At least in 2D, we now have nice frameworks for doing multi-animal 2D tracking like DeepLabCut and SLEAP, but no framework is available for multi-animal 3D yet as far as I know.4 Still, as far as I could tell, nobody has even adequately solved this problem. Both Kostas Daniilidis and the authors of the 3D-MuPPET readily admitted that the multi-animal aspect in 3D still needs work. Perhaps we’ll see some new solutions at CVPR2024.


Studying animal behavior is a huge endeavor and computer vision could really help. I saw a lot more ethologists at this workshop compared to 2 years ago. I wonder if it has to do with one of the organizers (Sara Beery) working closely with ethologists.

There were also a bunch of behavior datasets, which I reviewed above.

There were a lot of behavior papers, so I further subdivided them into multiple categories here.

Behavioral embeddings

(a) We may think of the behavior as a sequence of states. Kwon, Kim et al ask “How can we measure whether the behavioral embedding is good?”
(b) They take the perspective that the better embeddings encode behaviors that are closer in time also closer in space.
(c) To quantify this, they propose the “Temporal Proximity Index”, which is higher when temporally-adjacent behaviors are embedded closer together.

We don’t really understand behavior and I often find human-specified behavioral categories reductive. For instance, there are many types of grooming for instance and each of these may occur more or less often in different environments or following different actions. Still, these are often subsided into a single category (“grooming”) for simplicity. It is hard to find a good behavioral subdivision level and maybe it doesn’t even matter for the purpose of the study. Nevertheless, the problem of finding behaviors unsupervised remains.

So… what if we could somehow get some embeddings on behavior without human input on what the behaviors actually are? These posters try a few different ways to obtain these embeddings:

I particularly liked the SUBTLE poster, mostly because they designed a metric to evaluate behavioral embeddings based on UMAP and tested a few different hyperparameters to see what worked best. These kinds of embeddings have been pretty common in the literature (e.g. the embeddings in the Anipose and DANNCE papers are based on UMAP as well), so this evaluation felt quite relevant.

Behavioral monitoring

A pipeline to detect abnormal gait from video clips in Florida panthers and Bobcats, from Wasmuht et al, 2023

Another aspect of behavior that I saw was the idea of monitoring behavior in the wild using computer vision. This promises to scale up the analysis from camera traps and perhaps speed up interventions to help the wild life.

Detecting a gait disorder in wild cats was particularly interesting to me, as I thought it was a particularly cool and motivated application of computer vision. Help those poor cats!

Multi-animal behavior

Finally, I also saw a few posters focused on modeling behavior in multiple animals. A few were directly fom videos, which I thought was particularly interesting. This does feel like a sub-field to watch in the next few years. Animal behavior in the wild often naturally includes interactions across multiple animals, so having a way to quantify it would push the frontier of natural behavior analysis.

Animal wellness

Rabbit pain detector from Feighelstein et al, 2023. (Above) A pipeline for detecting pain in rabbits from videos. The detector works frame by frame. (Below) Examples of frames of rabbits that have no pain versus rabbits in pain.

One topic that stood out to me was animal wellness, specifically detecting pain of animals from images and videos. I hadn’t really thought to apply computer vision for this purpose before.

Featured as a keynote speaker was Albert Ali Salah, who presented his work on detecting horse pain from images and dog pain from videos.

There were also a few posters from Anna Zamansky’s group:

I think this is quite a promising area for computer vision application actually. For one, using the standard scales for rating pain in animals seem quite time consuming for animal care specialists. Second, it’s possible there might be some better signals of pain that we are currently missing, that a computer vision algorithm could pick out.

The ground truth can be somewhat messy though. The ground truth here is based on whether an animal is in pain from an operation or otherwise just from human expert annotations. It’s definitely not perfect and so even the best algorithm has a limit on its accuracy…


  1. I think this was actually presented at CV4Animals 2022, but I happened to come across it while trying to find the “AnimalTracks-17” dataset and figured I’d include it here.↩︎

  2. I couldn’t find this dataset online or a paper referencing it. I do vividly remember this poster being presented at CV4Animals though. “Tracks” here references footprints. The first author is Risa Shinoda.↩︎

  3. I couldn’t find a link for this poster. There doesn’t seem to be a publication with this title either. The first author is Luke Meyers, if you’re looking at this and it has been published somewhere in the meantime.↩︎

  4. I do plan to add 3D multi-animal support in Anipose over the next year. Given that the authors behind DANNCE released a multi-animal dataset, I wonder if they’re planning for multi-animal support soon as well.↩︎

  5. Technically, this was presented at the multi-agent behavior workshop, but still at CVPR2023. I don’t have an image for this one, but the authors are Daiyao Yi, Elizabeth S. Wright, Nancy Padilla-Coreano, Shreya Saxena (all at University of Florida). They found a way to represent an embedding for each frame in a video that captures social behavior.↩︎

  6. see footnote 5 above↩︎


BibTeX citation:
  author = {Lili Karashchuk},
  editor = {},
  title = {CV4Animals 2023: {The} State of the Art in Quantifying Animal
    Movement and Behavior},
  date = {2023-06-19},
  url = {},
  langid = {en}
For attribution, please cite this work as:
Lili Karashchuk. 2023. “CV4Animals 2023: The State of the Art in Quantifying Animal Movement and Behavior.” June 19, 2023.