Advances in computer vision as well as increasingly widespread video-based behavioral monitoring have great potential for transforming how we study animal cognition and behavior. However, there is still a fairly large gap between the exciting prospects and what can actually be achieved in practice today, especially in videos from the wild. With this perspective paper, we want to contribute towards closing this gap, by guiding behavioral scientists in what can be expected from current methods and steering computer vision researchers towards problems that are relevant to advance research in animal behavior. We start with a survey of the state-of-the-art methods for computer vision problems that are directly relevant to the video-based study of animal behavior, including object detection, multi-individual tracking, (inter)action recognition and individual identification. We then review methods for effort-efficient learning, which is one of the biggest challenges from a practical perspective. Finally, we close with an outlook into the future of the emerging field of computer vision for animal behavior, where we argue that the field should move fast beyond the common frame-by-frame processing and treat video as a first-class citizen.
The diverse nature of visual environments demands that the retina, the first stage of the visual system, encodes a vast range of stimuli with various statistics. The retina adapts its computations to some specific features of the input, such as brightness, contrast or motion. However, it is less clear whether it also adapts to the statistics of natural scenes compared to white noise, the latter of which is often used to infer models of retinal computation. To address this question, we analyzed neural activity of retinal ganglion cells (RGCs) in response to both white noise and naturalistic movie stimuli. We performed a systematic comparative analysis of traditional linear-nonlinear (LN) and recent convolutional neural network (CNN) models and tested their generalization across stimulus domains. We found that no model type trained on one stimulus ensemble was able to accurately predict neural activity on the other, suggesting that retinal processing depends on the stimulus statistics. Under white noise stimulation, the receptive fields of the neurons were mostly lowpass, while under natural image statistics they exhibited a more pronounced surround resembling the whitening filters predicted by efficient coding. Together, these results suggest that retinal processing dynamically adapts to the stimulus statistics.
Memorization in large language models (LLMs) is a growing concern. LLMs have been shown to easily reproduce parts of their training data, including copyrighted work. This is an important problem to solve, as it may violate existing copyright laws as well as the European AI Act. In this work, we propose a systematic analysis to quantify the extent of potential copyright infringements in LLMs using European law as an example. Unlike previous work, we evaluate instruction-finetuned models in a realistic end-user scenario. Our analysis builds on a proposed threshold of 160 characters, which we borrow from the German Copyright Service Provider Act and a fuzzy text matching algorithm to identify potentially copyright-infringing textual reproductions. The specificity of countermeasures against copyright infringement is analyzed by comparing model behavior on copyrighted and public domain data. We investigate what behaviors models show instead of producing protected text (such as refusal or hallucination) and provide a first legal assessment of these behaviors. We find that there are huge differences in copyright compliance, specificity, and appropriate refusal among popular LLMs. Alpaca, GPT 4, GPT 3.5, and Luminous perform best in our comparison, with OpenGPT-X, Alpaca, and Luminous producing a particularly low absolute number of potential copyright violations. Code will be published soon.
Forecasting long-term 3D human motion is challenging: the stochasticity of human behavior makes it hard to generate realistic human motion from the input sequence alone. Information on the scene environment and the motion of nearby people can greatly aid the generation process. We propose a scene-aware social transformer model (SAST) to forecast long-term (10s) human motion motion. Unlike previous models, our approach can model interactions between both widely varying numbers of people and objects in a scene. We combine a temporal convolutional encoder-decoder architecture with a Transformer-based bottleneck that allows us to efficiently combine motion and scene information. We model the conditional motion distribution using denoising diffusion models. We benchmark our approach on the Humans in Kitchens dataset, which contains 1 to 16 persons and 29 to 50 objects that are visible simultaneously. Our model outperforms other approaches in terms of realism and diversity on different metrics and in a user study.
Retinal ganglion cells, the output neurons of the vertebrate retina, often display nonlinear summation of visual signals over their receptive fields. This creates sensitivity to spatial contrast, letting the cells respond to spatially structured visual stimuli, such as a contrast-reversing grating, even when no net change in overall illumination of the receptive field occurs. Yet, computational models of ganglion cell responses are often based on linear receptive fields. Nonlinear extensions, on the other hand, such as subunit models, which separate receptive fields into smaller, nonlinearly combined subfields, are often cumbersome to fit to experimental data, in particular when natural stimuli are considered. Previous work in the salamander retina has shown that sensitivity to spatial contrast in response to flashed images can be partly captured by a model that combines signals from the mean and variance of luminance signals inside the receptive field. Here, we extend this spatial contrast model for application to spatiotemporal stimulation and explore its performance on spiking responses that we recorded from retinas of marmosets under artificial and natural movies. We show how the model can be fitted to experimental data and that it outperforms common models with linear spatial integration, in particular for parasol ganglion cells. Finally, we use the model framework to infer the cells' spatial scale of nonlinear spatial integration and contrast sensitivity. Our work shows that the spatial contrast model provides a simple approach to capturing aspects of nonlinear spatial integration with only few free parameters, which can be used to assess the cells' functional properties under natural stimulation and which provides a simple-to-obtain benchmark for comparison with more detailed nonlinear encoding models.
In crop protection, disease quantification parameters such as disease incidence (DI) and disease severity (DS) are the principal indicators for decision making, aimed at ensuring the safety and productivity of crop yield. The quantification is standardized with leaf organs, defined as individual scoring units. This study focuses on identifying and segmenting individual leaves in agricultural fields using unmanned aerial vehicle (UAV), multispectral imagery of sugar beet fields, and deep instance segmentation networks (Mask R-CNN). Five strategies for achieving network robustness with limited labeled images are tested and compared, employing simple and copy-paste image augmentation techniques. The study also evaluates the impact of environmental conditions on network performance. Metrics of performance show that multispectral UAV images recorded under sunny conditions lead to a performance drop. Focusing on the practical application, we employ Mask R-CNN models in an image-processing pipeline to calculate leaf-based parameters including DS and DI. The pipeline was applied in time-series in an experimental trial with five varieties and two fungicide strategies to illustrate epidemiological development. Disease severity calculated with the model with highest Average Precision (AP) shows the strongest correlation with the same parameter assessed by experts. The time-series development of disease severity and disease incidence demonstrates the advantages of multispectral UAV-imagery in contrasting varieties for resistance, as well as the limits for disease control measurements. This study identifies key components for automatic leaf segmentation of diseased plants using UAV imagery, such as illumination and disease condition. It also provides a tool for delivering leaf-based parameters relevant to optimize crop production through automated disease quantification by imaging tools.
Unsupervised graph representation learning has recently gained interest in several application domains such as neuroscience, where modeling the diverse morphology of cell types in the brain is one of the key challenges. It is currently unknown how many excitatory cortical cell types exist and what their defining morphological features are. Here we present GraphDINO, a purely data-driven approach to learn low-dimensional representations of 3D neuronal morphologies from unlabeled large-scale datasets. GraphDINO is a novel transformer-based representation learning method for spatially-embedded graphs. To enable self-supervised learning on transformers, we (1) developed data augmentation strategies for spatially-embedded graphs, (2) adapted the positional encoding and (3) introduced a novel attention mechanism, AC-Attention, which combines attention-based global interaction between nodes and classic graph convolutional processing. We show, in two different species and across multiple brain areas, that this method yields morphological cell type clusterings that are on par with manual feature-based classification by experts, but without using prior knowledge about the structural features of neurons. Moreover, it outperforms previous approaches on quantitative benchmarks predicting expert labels. Our method could potentially enable data-driven discovery of novel morphological features and cell types in large-scale datasets. It is applicable beyond neuroscience in settings where samples in a dataset are graphs and graph-level embeddings are desired.
Understanding how biological visual systems process information is challenging due to the complex nonlinear relationship between neuronal responses and high-dimensional visual input. Artificial neural networks have already improved our understanding of this system by allowing computational neuroscientists to create predictive models and bridge biological and machine vision. During the Sensorium 2022 competition, we introduced benchmarks for vision models with static input. However, animals operate and excel in dynamic environments, making it crucial to study and understand how the brain functions under these conditions. Moreover, many biological theories, such as predictive coding, suggest that previous input is crucial for current input processing. Currently, there is no standardized benchmark to identify state-of-the-art dynamic models of the mouse visual system. To address this gap, we propose the Sensorium 2023 Competition with dynamic input. This includes the collection of a new large-scale dataset from the primary visual cortex of five mice, containing responses from over 38,000 neurons to over 2 hours of dynamic stimuli per neuron. Participants in the main benchmark track will compete to identify the best predictive models of neuronal responses for dynamic input. We will also host a bonus track in which submission performance will be evaluated on out-of-domain input, using withheld neuronal responses to dynamic input stimuli whose statistics differ from the training set. Both tracks will offer behavioral data along with video stimuli. As before, we will provide code, tutorials, and strong pre-trained baseline models to encourage participation. We hope this competition will continue to strengthen the accompanying Sensorium benchmarks collection as a standard tool to measure progress in large-scale neural system identification models of the entire mouse visual hierarchy and beyond.
show abstract
Neurons in the neocortex exhibit astonishing morphological diversity which is critical for properly wiring neural circuits and giving neurons their functional properties. The extent to which the morphological diversity of excitatory neurons forms a continuum or is built from distinct clusters of cell types remains an open question. Here we took a data-driven approach using graph-based machine learning methods to obtain a low-dimensional morphological bar code describing more than 30,000 excit
show abstract
The neural underpinning of the biological visual system is challenging to study experi- mentally, in particular as neuronal activity becomes increasingly nonlinear with respect to visual input. Artificial neural networks (ANNs) can serve a variety of goals for improving our understanding of this complex system, not only serving as predictive digital twins of sensory cortex for novel hypothesis generation in silico, but also incorporating bio-inspired architectural motifs to progressively bridge the gap between biological and machine vision. The mouse has recently emerged as a popular model system to study visual information processing, but no standardized large-scale benchmark to identify state-of-the-art mod- els of the mouse visual system has been established. To fill this gap, we proposed the SENSORIUM benchmark competition. We collected a large-scale dataset from mouse primary visual cortex containing the responses of more than 28,000 neurons across seven mice stimu- lated with thousands of natural images, together with simultaneous behavioral measurements that include running speed, pupil dilation, and eye movements. The benchmark challenge ranked models based on predictive performance for neuronal responses on a held-out test set, and included two tracks for model input limited to either stimulus only (SENSORIUM) or stimulus plus behavior (SENSORIUM+). As a part of the NeurIPS 2022 competition track, we received 172 model submissions from 26 teams, with the winning teams improving our previous state-of-the-art model by more than 15%. Dataset access and infrastructure for evaluation of model predictions will remain online as an ongoing benchmark. We would like to see this as a starting point for regular challenges and data releases, and as a standard tool for measuring progress in large-scale neural system identification models of the mouse visual system and beyond.
show abstract
The neural underpinning of the biological visual system is challenging to study experimentally, in particular as the neuronal activity becomes increasingly nonlinear with respect to visual input. Artificial neural networks (ANNs) can serve a variety of goals for improving our understanding of this complex system, not only serving as predictive digital twins of sensory cortex for novel hypothesis generation in silico, but also incorporating bio-inspired architectural motifs to progressively bridge the gap between biological and machine vision. The mouse has recently emerged as a popular model system to study visual information processing, but no standardized large-scale benchmark to identify state-of-the-art models of the mouse visual system has been established. To fill this gap, we propose the Sensorium benchmark competition. We collected a large-scale dataset from mouse primary visual cortex containing the responses of more than 28,000 neurons across seven mice stimulated with thousands of natural images, together with simultaneous behavioral measurements that include running speed, pupil dilation, and eye movements. The benchmark challenge will rank models based on predictive performance for neuronal responses on a held-out test set, and includes two tracks for model input limited to either stimulus only (Sensorium) or stimulus plus behavior (Sensorium+). We provide a starting kit to lower the barrier for entry, including tutorials, pre-trained baseline models, and APIs with one line commands for data loading and submission. We would like to see this as a starting point for regular challenges and data releases, and as a standard tool for measuring progress in large-scale neural system identification models of the mouse visual system and beyond.
show abstract
Divisive normalization (DN) is a prominent computational building block in the brain that has been proposed as a canonical cortical operation. Numerous experimental studies have verified its importance for capturing nonlinear neural response properties to simple, artificial stimuli, and computational studies suggest that DN is also an important component for processing natural stimuli. However, we lack quantitative models of DN that are directly informed by measurements of spiking responses in the brain and applicable to arbitrary stimuli. Here, we propose a DN model that is applicable to arbitrary input images. We test its ability to predict how neurons in macaque primary visual cortex (V1) respond to natural images, with a focus on nonlinear response properties within the classical receptive field. Our model consists of one layer of subunits followed by learned orientation-specific DN. It outperforms linear-nonlinear and wavelet-based feature representations and makes a significant step towards the performance of state-of-the-art convolutional neural network (CNN) models. Unlike deep CNNs, our compact DN model offers a direct interpretation of the nature of normalization. By inspecting the learned normalization pool of our model, we gained insights into a long-standing question about the tuning properties of DN that update the current textbook description: we found that within the receptive field oriented features were normalized preferentially by features with similar orientation rather than non-specifically as currently assumed.
show abstract
Understanding the diversity of cell types and their function in the brain is one of the key challenges in neuroscience. The advent of large-scale datasets has given rise to the need of unbiased and quantitative approaches to cell type classification. We present GraphDINO, a purely data-driven approach to learning a low dimensional representation of the 3D morphology of neurons. GraphDINO is a novel graph representation learning method for spatial graphs utilizing self-supervised learning on transformer models. It smoothly interpolates between attention-based global interaction between nodes and classic graph convolutional processing. We show that this method is able to yield morphological cell type clustering that is comparable to manual feature-based classification and shows a good correspondence to expert-labeled cell types in two different species and cortical areas. Our method is applicable beyond neuroscience in settings where samples in a dataset are graphs and graph-level embeddings are desired.
Neural Data Science Group
Institute of Computer Science
University of Goettingen