ACIVS 2025 – Program Abstracts

Paper 4 — Robust Road Surface Normal and Pitch Prediction via IMU-Camera Fusion

Abstract: Predicting road surface normal and pitch with image-based algorithms remains a significant challenge, especially when steep inclines, declines, and sudden changes in road inclination are involved. To solve this problem, we propose a novel image-based algorithm that leverages homography decomposition to achieve accurate ground plane normal and pitch estimation. By integrating a Kalman Filter, our method enhances prediction stability. Further, we incorporate robust sensor fusion by integrating IMU-based odometry, ensuring that the estimates are accurately aligned with real-world motion. Overall, our approach outperforms existing techniques in both accuracy and responsiveness for dynamic driving environments. Experimental results show that our approach achieves superior performance, reducing average pitch and normal errors by 0.493° and 0.483°, respectively, compared to the current state-of-the-art, and exhibits a shorter transient response in case of a sudden road inclination change. Code is available at: https://norbertmarko.github.io/ground-normal-prediction/.

Authors: Norbert Marko (HUN-REN SZTAKI)*; Zoltan Rozsa (HUN-REN SZTAKI); Aron Ballagi (Szechenyi Istvan University); Tamas Sziranyi (HUN-REN SZTAKI)

Paper 5 — SinoDAM : A Volumetric Sinogram-Based Methodology for Realistic Dataset Augmentation in Additive Manufacturing

Abstract: Additive manufacturing (AM) enables the production of complex geometries but presents challenges related to defect formation, especially in powder bed fusion technologies where defects like porosities are common. Efficient and accurate defect identification in metallic parts is critical but often time-consuming, error-prone, and costly. X-ray computed tomography (XCT) is widely used for non-destructive defect detection. Still, it suffers from high computational demands, numerical artifacts, and reduced spatial resolution due to the reconstruction process. Deep learning defect inspections require large dataset but only small dataset are available, as of now. To address this challenge, a previously published method introduced a 2D defect extractor operating directly in the Radon space. While effective, this approach is limited by its slice-by-slice processing. To overcome this, we propose the Sinogram Defect Augmentation Methodology (SinoDAM), which refines the previous workflow into a 3D approach. SinoDAM generates volumetric variations commonly observed in industrial conditions by incorporating geometric transformations. Operating directly in the sinogram domain ensures the physical plausibility of synthetic data while avoiding inconsistencies introduced during reconstruction. SinoDAM offers an effective solution for synthesizing diverse and realistic dataset in the Radon space. It is designed to support future deep learning applications in defect detection and classification, offering a more scalable and accurate solution for quality control in AM.

Authors: Nina Lassalle-Astis (CETIM Sud-Ouest)*; Pascal Desbarats (LaBRI); Fabien Baldacci (LaBRI); Romain Brault (CETIM)

Paper 6 — Towards Optimizing Swarm Drone Delivery in RF-Denied Environments

Abstract: Swarm drone delivery has the potential to enhance reliability and scalability, but challenges in communication, coordination, redundancy, and handling complex scenarios still exist. This paper introduces an infrared (IR)-based Leader-Follower drone coordination system with load-sensing for low-light and RF-denied environments. The system employs a tree hierarchical network topology, where the leader drone transmits pose data via IR, and follower drones, equipped with NoIR cameras, process IR light patterns in real-time using convolutional neural networks (CNNs) and Fourier Transform. IR detection achieved 87% accuracy at 0.7 m in normal light and 95% at 1 m in low light. Additionally, this work presents string pose estimation (SPE) and self-balancing tray (SBT) models for balanced delivery. SPE uses symmetrical tethers to manage payload swing, real-time load adjustment, and even-load distribution, with a precision of 0.87, recall of 1.0, and F1 score of 0.93. SBT dynamically adjusts elastic tethers for minor imbalances. The system demonstrated considerable flight stability, maintaining minimal deviations in roll, pitch, and yaw, thus ensuring smooth and controlled drone movements. These innovations enhance drone stability and accuracy, making the system robust for challenging delivery environments.

Authors: Endrowednes Kuantama (Macquarie University)*; Alice James (Macquarie University); Avishkar Seth (Macquarie University)

Paper 7 — Active Deep Clustering: Exploratory Analysis to Assist in Decision-Making on Incremental Label Morphing Datasets

Abstract: Many supervised-based training schemes rely on the need to have a single associated label for each data sample within a set. Where the goal is to learn different levels of granularity of the data in an implicit form, in the context of neural networks, this is accomplished with different layers to extract features at distinct levels. In this work, we explore more explicit labelling structures where each sample has multiple labels forming relationships at both an abstract and fine-grained level, producing a tree for each associated data sample. This novel type of training scheme utilises a refinement strategy based on deep clustering approaches to detect the colliding and splitting of clusters where each is assigned a label. Experts can then be queried to determine if those colliding clusters should belong to a single label, or alternatively, if they are splitting, should we create new labels, forming an active component of our training scheme. Colliding clusters form a parent label while splitting clusters form sibling labels within the tree structure. By utilizing a tree data structure to represent labels at different levels of granularity, we can invoke explicitly defined relationships and dependencies to form a more structured and interpretable representation of data. Instead of treating data as flat, homogenous sets, we allow for the exploitation of hierarchical relationships and leverage inherent structure to improve data efficiency. We present a case study of the approach applied within the steel manufacturing domain, where quality control remains an active challenge due to morphing labels as products move down the production line.

Authors: Connor Clarkson (Swansea University)*; Michael Edwards (Swansea University); Xianghua Xie (Swansea University)

Paper 8 — Gait Recognition via Pristine Feature Learning

Abstract: Gait recognition using 3D point clouds offers resilience to appearance variations but faces challenges in modeling sparse, noisy, and temporally misaligned sequences. We propose the Pristine Learning Framework, designed to extract noise-invariant gait representations from 3D point cloud data. At its core, a 2 plus 1 D convolution network separates spatial and temporal feature learning, reducing model complexity while improving resilience under limited data. A key innovation is the Pristine Feature Vector, a dynamically refined representation of the intrinsic gait cycle learned by fusing features from N subjects mitigating any individual biases and environmental noise. This vector is continuously updated through a self-attention mechanism that enhances discriminative biomechanical regions and critical gait phases while suppressing transient noise. Additionally, a squeeze-and-excitation module aligns gait features with the pristine vector, further improving generalization across subjects. Experimental evaluations on benchmark dataset demonstrate state-of-the-art accuracy, with a 5 percent improvement in Rank 1 accuracy over existing methods. This work advances the feasibility of deployable, privacy-preserving gait recognition systems by addressing critical challenges in 3D spatiotemporal modeling and noise-invariant feature learning.

Authors: Anuj Rathore (Fujitsu Research)*; Daksh Thapar (Fujitsu Research); Mahesh Chandran (Fujitsu Research)

Paper 10 — FCR-PoseHRNet: Flexible Feature Realignment and Cross-Resolution Coordinate Refinement in PoseHRNet for 2D Human Pose Estimation

Abstract: Human Pose Estimation (HPE) remains challenging due to variations in body scales, frequent occlusions, and high computational requirements. Although heatmap-based methods achieve strong accuracy, they typically rely on large upsampling operations, leading to significant memory usage and complicating deployment in real-time or resource-constrained devices. In this paper, we propose FCR-PoseHRNet, Flexible Feature Realignment, and Cross-Resolution Coordinate Refinement, a coordinate-based framework that addresses these issues. Our approach modifies the HRNet backbone by replacing standard convolutions with depthwise separable variants, creating an efficient architecture that preserves strong representational power. To handle complex poses and occluded keypoints, we introduce Adaptive Feature Realignment (AFR) to globally adjust feature distributions, alongside a Cross-Resolution Channel Attention (CRCA) module for effective multi-scale feature fusion. Finally, a Fractal Coordinate Refinement (FCR) procedure iteratively refines joint coordinates, avoiding the need for large heatmap predictions. Evaluations on the COCO, CrowdPose, and MPII benchmarks show that FCR-PoseHRNet achieves competitive pose accuracy compared to state-of-the-art methods while significantly reducing computational overhead, confirming its suitability for real-time applications under challenging conditions.

Authors: Zakir Ali (The University of Electro-Communications)*; Benitez-Garcia Gibran (The University of Electro-Communications); Hiroki Takahashi (The University of Electro-Communications)

Paper 11 — Correction of the Jitter Effect in Pléiades Satellite Elevation Data for Enhanced 3D Change Monitoring

Abstract: The satellite jitter effect observable in the digital elevation model of difference, herein referred to as DoD, generated using tri-stereo pairs of Pléiades images is a common phenomenon that can reduce the precision of elevation measurements making it challenging for 3D change detection in the land surface. To address this issue, previous studies used correction methods based on polynomial and sinusoid fitting to reduce the jitter effect. However, such approaches do not entirely remove these undulations. In this paper, the image noise reduction task is framed as a signal filtering problem, to eliminate these repetitive patterns while preserving relevant information for accurate change detection using Fourier transform. The noise frequency is identified through the frequency spectrum using a threshold-based method, allowing the subsequent application of a finite impulse response filter to remove the noise. Experiments conducted on Pléiades DoDs data covering the Lebna watershed located in the north of Tunisia validate the effectiveness of this approach. The results demonstrate that the Fourier transform-based filtering method significantly outperforms the state-of-the-art methods, both in terms of qualitative visual assessment and quantitative performance metrics.

Authors: Imen Brini (IRD)*

Paper 14 — ICE-Cubed: Inpainting of Cinematographic Elements for Intelligent Context Expansion

Abstract: Immersive display systems like ICE Theaters enhance viewer experience by extending the field of view. However, maintaining immersion is challenging due to peripheral vision sensitivity. Elements near screen edges can disrupt the experience, and are difficult to detect as they lack semantic meaning and are highly context-dependent. We propose a two-step framework to address this issue. We benchmarked state-of-the-art segmentation models on our application. The best one, BiRefNet, is retrained on a custom dataset to improve detection. To complete this detection phase, a morphological dilation operation enhances mask coverage, followed by a filter to temporally stabilize detection. Second, we use E2FGVI for video inpainting, ensuring smooth content extension, then RealBasicVSR for super-resolution. To validate our approach, detection masks are assessed through expert evaluation and state-of-the-art mask quality metrics. Our results show superior performance over existing methods, both quantitatively and qualitatively. Finally, a perceptual study conducted with 5 ICE experts across 29 cinematic contents confirms our framework’s ability to enhance immersive cinema experiences.

Authors: David Traparic (ISAE-ENSMA)*; Maugan De Murcia (Xlim CNRS UMR 7252); Mohamed-Chaker Larabi (Xlim CNRS UMR 7252); Ladjel Bellatreche (ISAE-ENSMA)

Paper 16 — Task-oriented Robotic Manipulation with Vision Language Models

Abstract: Vision-LanguageModels (VLMs) play a crucial role in robotic manipulation by enabling robots to understand and interpret the visual properties of objects and their surroundings, allowing them to perform manipulation based on this multimodal understanding. However, understanding object attributes and spatial relationships is a non-trivial task but is critical in robotic manipulation tasks. In this work, we present a new dataset focused on spatial relationships and a novel method to utilize VLMs to perform object manipulation with task-oriented, high-level input. In this dataset, the spatial relationships between objects are manually described as captions. Additionally, each object is labeled with multiple attributes, such as fragility, mass, material, and transparency, derived from a fine-tuned vision language model. The embedded object information from captions are automatically extracted and transformed into a data structure (in this case, tree, for demonstration purposes) that captures the spatial relationships among the objects within each image. The tree structures, along with the object attributes, are then fed into a language model to transform into a new tree structure that determines how these objects should be organized in order to accomplish a specific (high-level) task. We demonstrate that our method not only improves the comprehension of spatial relationships among objects in the visual environment but also enables robots to interact with these objects more effectively. As a result, this approach significantly enhances spatial reasoning in robotic manipulation tasks. To our knowledge, this is the first method of its kind in the literature, offering a novel solution that allows robots to more efficiently organize and utilize objects in their surroundings.

Authors: Nurhan Bulus-Guran (Swansea University ); Hanchi Ren (Swansea University ); Jingjing Deng (Durham University ); Xianghua Xie (Swansea University)*

Paper 20 — Scoring-based Copy-Paste for Augmenting Crowded Pedestrian

Abstract: Detecting pedestrians in crowded scenes is a challenging task due to severe occlusions and complex object interactions. To improve data augmentation for this setting, we propose a novel Scoring-based Copy-Paste Pipeline (SCPP). Unlike previous methods that randomly paste objects near each other, SCPP introduces a principled three-stage augmentation pipeline that incorporates patch scoring and flexible object placement. We design a new metric called Neighbor-Weighted IoU to more effectively capture the complexity of object overlaps. Our approach ensures more diverse occlusion patterns, as well as better crowdedness consistency across pasted patches. We conduct experiments on the CrowdHuman dataset using mainstream detectors, showing that SCPP consistently outperforms existing targeted copy-paste techniques on standard evaluation metrics. Notably, these improvements are achieved without additional training data or complex post-processing, demonstrating the potential of our approach for enhancing pedestrian detection performance in crowded scenarios.

Authors: JUI-CHE CHANG (National Tsing Hua University)*

Paper 24 — Contrastive Learning through Auxiliary Branch for Video Object Detection

Abstract: Video object detection is a challenging task because videos often suffer from image deterioration such as motion blur, occlusion, and deformable shapes, making it significantly more difficult than detecting objects in still images. Prior approaches have improved video object detection performance by employing feature aggregation and complex post-processing techniques, though at the cost of increased computational demands. To improve robustness to image degradation without additional computational load during inference, we introduce a straightforward yet effective Contrastive Learning through Auxiliary Branch (CLAB) method. First, we implement a constrastive auxiliary branch using a contrastive loss to enhance the feature representation capability of the video object detector’s backbone. Next, we propose a dynamic loss weighting strategy that emphasizes auxiliary feature learning early in training while gradually prioritizing the detection task as training converges. We validate our approach through comprehensive experiments and ablation studies, demonstrating consistent performance gains. Without bells and whistles, CLAB reaches a performance of 84.0% mAP and 85.2% mAP with ResNet-101 and ResNeXt-101, respectively, on the ImageNet VID dataset, thus achieving state-of-the-art performance for CNN-based models without requiring additional post-processing methods.

Authors: Lucas RAKOTOARIVONY (Thales Research & Technology)*

Paper 26 — Pretraining Techniques for Ra Prediction with Long Thin Spatial Industrial Data

Abstract: Machine learning offers promising advancements in indus- trial processes, yet collecting labeled samples during production remains challenging. In steel production, the surface roughness Ra parameter of steel coils is crucial, but on-line labeled data collection, with our apparatus, is infeasible, while off-line methods are time-consuming and imperfect. However, unlabeled samples are readily available from on-line production. This paper examines pretraining on a large, unlabeled dataset and its impacts on performance after fine-tuning on a smaller labeled dataset. We use three techniques: (1) contrastive learning, (2) Autoencoder, and (3) Classification of coil ID. We address the challenges posed by the unique structure of the data, comprising 2-dimensional, long and thin arrays. Our results show that our classification pretraining approach improves regression performance and outperforms the baseline.

Authors: Alex Milne (Swansea University); Xianghua Xie (Swansea University)*; Gary Tam (Swansea University)

Paper 27 — CymruFluency – A fusion technique and a 4D Welsh dataset for Welsh fluency analysis

Abstract: Welsh is a linguistically rich yet under-resourced minority language. Despite its cultural significance, automated fluency assessment remains largely unexplored due to limited datasets and tools. Existing models focus on high-resource languages, leaving Welsh without sufficient multi-modal resources. To address this, we introduce CymruFluency, the first 4D dataset for Welsh fluency assessment, capturing both audio and 3D lip movements with expert-annotated fluency scores. Building on this, we propose a multi-modal fluency classification framework that combines audio features (mel spectrograms) and manually annotated 3D lip landmarks. Our fusion approach significantly improves fluency prediction over unimodal models, emphasizing the critical role of 3D lip dynamics in Welsh learning. This research advances minority language processing by integrating articulatory features into fluency evaluation, offering a powerful tool for Welsh language learning, assessment, and preservation. Project page: \url{https://github.com/arvinsingh/CymruFluency}

Authors: Arvin Bali (Swansea University); Gary Tam (Swansea University)*; Avishek Siris (Swansea University); Gareth Andrews (Swansea University); Yukun Lai (Cardiff University); Bernie Tiddeman (Aberystwyth University); Gwenno Ffrancon (Academi Hywel Teifi)

Paper 28 — Visualizing the Lifespan of Industrial Objects with AI-Generated Texture Space

Abstract: Real-world industrial assets exhibit diverse surface appearances due to multiple factors such as rust, wear, and dirt accumulation. However, simulation environments contain identical or clean object appearances, causing a significant gap between synthetic and real data. Moreover, manually designing unique and intricate realistic textures is inefficient. In this paper, we present a scalable texture synthesis pipeline to produce a multidimensional texture space featuring various industrial object lifespans. It consists of three stages: (1) state translation: to map textures to different states, (2) state interpolation: to generate smooth transitions across different statuses, and (3) space blending: to combine multiple factors affecting object appearances. We evaluate our pipeline with three common industrial assets: dollies, small load carrier boxes, and pallets. As a result, our pipeline generalizes to new shape variants, while preserving realistic and smooth transitions in extreme factors.

Authors: Joe Khalil (BMW Group)*; Chafic Abou Akar (BMW Group); Marc Barouky (BMW Group); Dani Azzam (BMW Group); Marc Kamradt (BMW Group); Raphaël Couturier (Univ. Marie et Louis Pasteur, FEMTO-ST Institute)

Abstract:

Paper 29 — SelCLR: Self-labeling with Contrastive Learning and Applications in Machine Vision Systems

Abstract: Self-supervised learning has become a potent paradigm for representation learning in image classification, especially when labeled data is limited. However, current methods often yield suboptimal feature representations and poor clustering, hindering their performance in downstream tasks. To address this, we propose a new self-supervised learning framework that enhances feature embeddings and clustering quality by integrating contrastive learning with cross-entropy-based self-labeling. Our approach employs a two-stage training process: initial pretraining of the backbone network using contrastive learning, followed by further optimization with our hybrid loss function. Extensive experiments on CIFAR, TTSD, and NEU datasets across various evaluation settings demonstrate the effectiveness of our method. Specifically, we achieve accuracies of 84.28% on CIFAR-10, 57.06% on CIFAR-100, and 86.46% on TTSD and 95.00% on NEU datasets for machine vision applications. Beyond classification accuracy, our method also offers a pseudo-labeling mechanism applicable to unlabeled datasets.

Authors: La Nguyen Gia Hy (Ho Chi Minh University of Technology, VNU-HCM); Duong Duc Tin (Ho Chi Minh University of Technology, VNU-HCM); Le Hong Trang (Ho Chi Minh University of Technology, VNU-HCM)*

Paper 30 — KENDALL-ROFT: Kendall’s shape analysis with Rigid transformation and Optical Flow for Transformation-based micro-expression recognition

Abstract: Micro-expressions are brief muscle changes that occur when people hide true emotions, making them hard to detect. Most existing approaches are pixel-based, which may fall short when external conditions such as lighting obscure the true expressions. To this end, a novel framework for micro-expression recognition called KENDALL-ROFT is proposed. It integrates: (1) a manifold-based approach leveraging Kendall’s shape space to capture subtle non-rigid deformations from 3D facial landmark configurations, (2) optical flow to highlight fine-grained facial displacements, and (3) rigid transformation parameters for global alignment. In fact, by embedding 3D facial landmarks into a pre-shape space and representing their differences via tangent vectors, our method isolates shape-specific variations from large-scale head movements. Optical flow, particularly when magnified, further accentuates transient muscle activations characteristic of micro-expressions, while rigid transformation (rotation, scale, translation) ensures reliable separation of overall head motion. Comprehensive evaluations on CASME2, CAS(ME)3 and SAMM datasets demonstrate consistent performance gains in both simple and complex recognition scenarios, reaching state-of-the-art performance. These results show the importance of integrating geometry-aware shape representations, fine-motion cues, and global alignment as a robust and unified architecture to micro expression recognition. Code is available at https://github.com/ROFT1/KENDALLROFT.

Authors: Arafet Sbei (Université de Strasbourg)*; Hassen Drira (Université de Strasbourg); Faten Chaieb (Paris Panthéon-Assas University, EFREI Research Lab)

Paper 31 — Pointy – A Lightweight Transformer for Point Cloud Foundation Models

Abstract: Foundation models for point cloud data have recently grown in capability, often leveraging extensive representation learning from language or vision. In this work, we take a more controlled approach by introducing a lightweight transformer-based point cloud architecture. In contrast to the heavy reliance on cross-modal supervision, our model is trained only on 39k point clouds – yet it outperforms several larger foundation models trained on over 200k training samples. Interestingly, our method approaches state-of-the-art results from models that have seen over a million point clouds, images, and text samples, demonstrating the value of a carefully curated training setup and architecture. To ensure rigorous evaluation, we conduct a comprehensive replication study that standardizes the training regime and benchmarks across multiple point cloud architectures. This unified experimental framework isolates the impact of architectural choices, allowing for transparent comparisons and highlighting the benefits of our design and other tokenizer-free architectures. Our results show that simple backbones can deliver competitive results to more complex or data-rich strategies. The implementation, including code, pre-trained models, and training protocols, is available at https://github.com/KonradSzafer/Pointy..

Authors: Konrad Szafer (Poznan University of Technology); Marek Kraft (Poznan University of Technology); Dominik Belter (Politechnika Poznańska)*

Paper 34 — Legibility vs. Extractability: Crafting Visual Defenses Against Automated OCR

Abstract: The rise of generative AI, particularly large language models (LLMs), has redefined problem-solving across domains—but it has also introduced new challenges in academic integrity. Students are increasingly using AI-powered Optical Character Recognition (OCR) tools to extract restricted content from screenshots, bypassing traditional safeguards that prevent copying and pasting. In this study, we investigate a set of visual defense techniques designed to counter automated OCR systems: (1) adversarial fonts crafted to disrupt character recognition, (2) color-based distortions that alter visual contrast, (3) animated interference lines that obstruct character boundaries, and (4) a novel cloud blur effect that dynamically follows the cursor to obscure localized text regions. We evaluate these strategies across multiple LLM-integrated OCR platforms—ChatGPT, DeepSeek, Claude, and Gemini. Our findings show that modern OCR tools remain largely unaffected by custom fonts, line obstructions, and color distortions. In contrast, the cloud blur technique significantly reduces OCR accuracy while preserving legibility for human readers. These results highlight dynamic, context-aware visual obfuscation as a promising and potentially future-proof solution for deterring AI-assisted text extraction. Cloud blur, in particular, emerges as the most effective approach, offering strong resistance to OCR while maintaining accessibility for legitimate human users. A live demo is available at https://cv.aut.ac.nz/nFonts

Authors: Minh Nguyen (Auckland University of Technology); Kien Tran (Auckland University of Technology)*; Han Huynh (Auckland University of Technology)

Paper 35 — Analysis of Long-term Player Action Prediction Performance Based on Causal Modelling in Rugby League

Abstract: Predicting athletes’ behavior is important in designing game strategies and enhancing audience engagement in sports analytics. While machine learning has enabled advancements in action prediction across various sports, most existing models primarily capture correlations rather than underlying causal relationships. Here, we study long-term player action prediction in team sports (e.g., Rugby League) by integrating causal behavior modeling with Graph Neural Networks (GNNs). We assessed the model’s performance over extended video sequences, evaluating its resilience to varying observation window sizes. Our results demonstrate that incorporating causal structures significantly improves long-range action prediction accuracy, with the transformer-based graph neural network (TransformerConv) exhibiting superior robustness.

Authors: Jiaxuan Wang (The University of Auckland); Ruigeng Wang (IVSLab, School of Computer Science, The University of Auckland); Shahrokh Heidari (IVSLab, School of Computer Science, The University of Auckland); David Soriano Valdez (IVSLab, School of Computer Science, The University of Auckland); Mitchell Rogers (IVSLab, School of Computer Science, The University of Auckland); Gael Gendron (IVSLab, School of Computer Science, The University of Auckland); Yani He ( IVSLab, School of Computer Science, The University of Auckland); Nicolas Mir ( IVSLab, School of Computer Science, The University of Auckland); Yalu Zou (The University of Auckland); Riki Mitchell (Riki Consulting Ltd, Auckland); Alfonso Gastelum Strozzi (ICAT, UNAM, Mexico City); Marcel Noronha (The Warriors, Penrose); Michael Witbrock (NAOI, The University of Auckland); Patrice Delmas (IVSLab, School of Computer Science, The University of Auckland)*

Paper 36 — Satellite image segmentation for landcover mapping using atrous spatial pyramid pooling and lightweight attention mechanism

Abstract: Most of the work done for landcover analysis using satellite imagery employed single-date images, which are incapable of capturing seasonal variations in landcover classes. To address this, we propose Attention-ASPP-FCN – a 3D Fully Convolutional Network (FCN) segmentation architecture based on 3D Atrous Spatial Pyramid Pooling (ASPP) modules and self-attention, to produce landcover maps that process Satellite Image Time-Series (SITS) rather than single-date satellite images, resulting in the generation of better landcover segmentation maps as the model captures the temporal dynamics of landcover, including phenological variations, seasonal changes in vegetation, and land use patterns over time. To capture long-range dependencies among features, we used two main strategies: (i) expanding the receptive field through ASPP modules that enable resampling the input feature vector at multiple dilation rates to capture the context at different scales, making the network more robust to translational variances, and (ii) incorporating a lightweight 3D timeframe-wise self-attention to assign more weightage to important timeframes in a SITS. The proposed model offers a reliable and efficient method for precisely segmenting different landcover classes from a SITS by utilizing pertinent spectral, spatial, and temporal data, achieving the highest mean Intersection over Union (mIoU) of 93.26% and 94.13%, respectively, in the Indian and Slovenian Regions of Interest (RoI). The proposed model improved the mIoU by a margin of 6.54-6.68% than that achieved by the best single-date model, thereby demonstrating the superiority of employing a SITS over single-date images for landcover segmentation.

Authors: Preetpal Buttar (Sant Longowal Institute of Engineering and Technology, Longowal)*

Paper 37 — Zero-Shot Seafloor Sediment Microtopography Characterization Using Stereo from a Drifting Monocular Camera

Abstract: High-resolution characterization of seafloor sediment microtopography is essential for understanding benthic habitat structure, sedimentary processes, and ecological function. However, existing methods typically rely on core-based sampling or specialized 3D imaging systems, both of which are limited by cost, complexity, and scalability. In this study, we present a cost-effective, camera-based framework for quantitative sediment surface analysis using video footage from a drifting monocular underwater camera. The method leverages a zero-shot application of RAFT-Stereo to estimate dense disparity maps from sequential frames without requiring prior training on sediment data. After normalizing disparities, we apply surface detrending and extract statistical and morphological roughness features. Through a small-scale case study, we evaluate the method on two distinct sediment types, Sand and Shell-Hash, and demonstrate that the extracted features effectively capture surface complexity. Additionally, we assess the pipeline’s consistency using overlapping samples acquired with different virtual stereo baselines, showing that key global features remain robust despite variations in camera motion. This framework offers a scalable, non-invasive solution for retrospective and in-situ sediment analysis in marine monitoring.

Authors: Shahrokh Heidari (University of Auckland)*; Mihailo Azhar (Aarhus University); Tegan Evans (University of Auckland); Yani He (University of Auckland); Stefano Schenone ( University of Auckland); Patrice Delmas ( University of Auckland); Simon Thrush ( University of Auckland)

Paper 38 — Keypoint-Integrated Instruction-Following Data Generation for Enhanced Human Pose and Action Understanding in Multimodal Models

Abstract: Current vision-language multimodal models are well-adapted for general visual understanding tasks. However, they perform inadequately when handling complex visual tasks related to human poses and actions due to the lack of specialized vision-language instruction-following data. We introduce a method for generating such data by integrating human keypoints with traditional visual features such as captions and bounding boxes, enabling more precise understanding of human-centric scenes. Our approach constructs a dataset comprising 200,328 samples tailored to fine-tune models for human-centric tasks, focusing on three areas: conversation, detailed description, and complex reasoning. We establish a benchmark called Human Pose and Action Understanding Benchmark (HPAUB) to assess model performance on human pose and action understanding. We fine-tune the LLaVA-1.5-7B model using this dataset and evaluate it on the benchmark, achieving significant improvements. Experimental results show an overall improvement of 21.18% compared to the original LLaVA-1.5-7B model. These findings highlight the effectiveness of keypoint-integrated data in enhancing multimodal models. Code is available at https://github.com/Ody-trek/Keypoint-Instruction-Tuning.

Authors: DEWEN ZHANG (The University of Electro-Communications)*; Wangpeng An (TikTok); Hayaru Shouno ( The University of Electro-Communications)

Paper 39 — Privacy aware Human-Object Interaction in the wild – Novel dataset

Abstract: Monitoring people in the wild is important for safety surveillance cameras enabling faster emergency responses, autonomous vehicles, and mobile robotic platforms. However, it also raises concerns around privacy and mass surveillance. To this end, thermal imagery is often applied as it reduces the privacy concerns, and works well in a variety of different weather and lighting conditions. Unfortunately, fewer thermal datasets are available compared to RGB datasets. In particular, there is a need for thermal datasets for human-object interaction detection which is the frontier for human scene understanding required for enabling the safety and autonomous tasks. Therefore, to advance the research field within real-world HOI scenarios, we introduce the first thermal HOI dataset. The dataset uses surveillance video of a harbor (Nikolow et al. 2021), and captures the long tail of real-world behavior presenting a challenge for machine learning systems.

Authors: Luna Lux Fredenslund (Aalborg University)*; Vasiliki Ismiroglou (Aalborg University); Thomas Moeslund (Aalborg University); Kamal Nasrollahi (Milestone Systems)

Paper 40 — Beyond Pixels: Leveraging the Language of Soccer to Improve Spatio-Temporal Action Detection in Broadcast Videos

Abstract: State-of-the-art spatio-temporal action detection (STAD) methods show promising results for extracting soccer events from broadcast videos. However, when operated in the high-recall, low-precision regime required for exhaustive event coverage in soccer analytics, their lack of contextual understanding becomes apparent: many false positives could be resolved by considering a broader sequence of actions and game-state information. In this work, we address this limitation by reasoning at the game level and improving STAD through the addition of a denoising sequence transduction task. Sequences of noisy, context-free player-centric predictions are processed alongside clean game state information using a Transformer-based encoder-decoder model. By modeling extended temporal context and reasoning jointly over team-level dynamics, our method leverages the “language of soccer” – its tactical regularities and inter-player dependencies – to generate “denoised” sequences of actions. This approach improves both precision and recall in low-confidence regimes, enabling more reliable event extraction from broadcast video and complementing existing pixel-based methods.

Authors: Jeremie OCHIN (Ecole des Mines de Paris)*; Raphael Chekroun (Footovision); Bogdan Stanciulescu (Ecole des Mines de Paris); Sotiris Manitsaris (Ecole des Mines de Paris)

Paper 42 — LOS Ground Displacement Monitoring in Northeast Tunisia Using SBAS InSAR

Abstract: The Lebna watershed in northeast Tunisia is marked by its important agricultural activities and its hydro-geological settings, enhancing its susceptibility to various geohazards including land subsidence and uplift. This study monitors the Line of Sight (LOS) deformation through the Interferometric Synthetic Aperture Radar (InSAR) technique. The main observations are LOS velocity maps derived from the Small Baseline Subset – InSAR, which were calculated from ESA Sentinel-1 satellites for the period 2015-2023. The SAR data resulted in velocities ranging between −1.05 and +0.80 cm/year. Both subsidence and uplifting trends were observed in different areas, indicating that the ground deformations in the study region could be dependent on geological and hydrogeological factors.

Authors: Imen Brini (IRD)*

Paper 43 — Improving Face Image Retrieval in Historical Archives: Fusion of Mirrored Images and Better Consensus Ranking

Abstract: This paper explores an effective method for retrieving additional images of a specific individual from large, unannotated photo collections using a reference query image. This task, known as face image retrieval (FIR), focuses on maximising recall by increasing the proportion of correct matches and improving precision by retrieving a greater number of relevant images. However, precision and recall inherently contradict each other, meaning that an increase in one often leads to a decrease in the other. To enhance the retrieval process, two key improvements are introduced. First, the existing consensus ranking method is strengthened to perform more reliably and faster. Second, it is discovered that mirroring the query image and averaging the corresponding face feature vectors leads to an overall improvement in both precision and recall. The proposed method is evaluated using both a historical photo dataset derived from criminal identification photos and a publicly available dataset for face recognition. Additionally, the efficiency of the CVLFace pipeline is examined, further assessing the overall effectiveness of the approach. These refinements contribute to a more robust approach for searching unannotated face photo databases, ensuring more effective retrieval performance.

Authors: Anders Hast (Uppsala University)*

Paper 44 — On-Device Continual Adaptation for Reliable Solar Irradiance Forecasting

Abstract: Solar energy is a vital renewable source in the ongoing energy transformation. However, the inherent uncertainty of solar-based energy sources prevents them from stable, large-scale integration into power grids. Photovoltaic energy production, reliably reflected by solar irradiance, is highly dependent on temporal changes and spatial-related, difficult-to-predict short-term meteorological fluctuations, such as cloud movements and coverage. Moreover, with various climate zones and dynamically evolving atmospheric conditions, it is impossible to cover all possible weather phenomena with conventional forecasting approaches based on static and centralized algorithms. To mitigate this issue, we proposed a continual adaptation strategy to adjust to local conditions and atmospheric phenomena. The developed incremental learning method leverages the stream of incoming sensor readings and prior model predictions to determine the relevance of the data based on the absolute magnitude of the prediction error. The computed significance level serves as an indicator to decide whether and which model parameters should be revised. The proposed approach has been verified against different latitudes, seasons, and weather conditions, demonstrating its ability to adapt rapidly. The experiments have shown an improvement in solar irradiance forecasting. Compared to the offline model, the mean absolute error has decreased by more than 60%, and the average mean absolute error has reduced by 52 W/m2 for a 30-minute forecast horizon. Moreover, the successful deployment and efficient pipeline processing on resource-constrained single-board computers showcase the strong applicability of the developed irradiance forecasting strategy.

Authors: Mateusz Piechocki (Poznan University of Technology)*; Marek Kraft (Poznan University of Technology); Alessandro Capotondi (University of Modena and Reggio Emilia)

Paper 46 — RowFormer: Multiple Class-Token-based Vision Transformer for 2D Context-Aware Attention

Abstract: In this paper, we introduce RowFormer, a novel transformer architecture that treats an image as a collection of sentences rather than a single one. This approach addresses challenges related to modifying class tokens and projection mechanisms. RowFormer combines the strengths of convolutional neural networks and vision transformers by integrating spatial convolution, as seen in Convolutional Vision Transformer and in Compact Convolutional Transformer. However, in this work, we introduce a novel transverse multi-class token design for vision transformers, contrasting with the default single-token approach. This approach would be well suited to scenes, multi-object images, and single-object images as well. Our experiments demonstrate that RowFormer achieves higher efficiency across multiple classification benchmarks compared to existing models. This improvement stems from our hypothesis that images should not be interpreted as continuous, paragraph-like representations but as structured sequences of rows, each providing its own information. By leveraging this insight, RowFormer enhances visual data processing, enabling more effective image classification and understanding in complex tasks.

Authors: Mohamed Dahmane (Computer Research Institute of Montréal (CRIM))*; Mamadou-dian Bah (Computer Research Institute of Montréal (CRIM)); Rajjeshwar Ganguly (Computer Research Institute of Montréal (CRIM))

Paper 51 — DeeCLIP: A Robust and Generalizable Transformer-Based Framework for Detecting AI-Generated Images

Abstract: This paper introduces DeeCLIP, a novel framework for detecting AI-generated images using CLIP-ViT and fusion learning. Despite significant advancements in generative models capable of creating highly photorealistic images, existing detection methods often struggle to generalize across different models and are highly sensitive to minor perturbations. To address these challenges, DeeCLIP incorporates DeeFuser, a fusion module that combines high-level and low-level features, improving robustness against degradations such as compression and blurring. Additionally, we apply triplet loss to refine the embedding space, enhancing the model’s ability to distinguish between real and synthetic content. To further enable lightweight adaptation while preserving pre-trained knowledge, we adopt parameter-efficient fine-tuning using low-rank adaptation (LoRA) within the CLIP-ViT backbone. This approach supports effective zero-shot learning without sacrificing generalization. Trained exclusively on 4-class ProGAN data, DeeCLIP achieves an average accuracy of 89.00% on 19 test subsets composed of generative adversarial network (GAN) and diffusion models. Despite having fewer trainable parameters, DeeCLIP outperforms existing methods, demonstrating superior robustness against various generative models and real-world distortions. The code is publicly available at https://github.com/Mamadou-Keita/DeeCLIP for research purposes.

Authors: Mamadou Keita (Laboratory of IEMN, CNRS, Centrale Lille, UMR 8520, Univ. Polytechnique Hauts-de-France, Valenciennes, F-59313, France)*; Wassim Hamidouche (Khalifa University, Abu Dhabi, UAE ); Hessen Bougueffa Eutamene (Laboratory of IEMN, CNRS, Centrale Lille, UMR 8520, Univ. Polytechnique Hauts-de-France, Valenciennes, F-59313, France); Abdelmalik Taleb-Ahmed (Laboratory of IEMN, CNRS, Centrale Lille, UMR 8520, Univ. Polytechnique Hauts-de-France, Valenciennes, F-59313, France); Abdenour Hadid (Sorbonne Center for Artificial Intelligence, Sorbonne University, Abu Dhabi, UAE)

Paper 55 — Oceans and algorithms: building successful collaborations in marine science and computer vision

Abstract: In recent years, the convergence of marine science and computer vision has opened new avenues for advancing ecological research through automation, large-scale data analysis, and novel insights into marine systems. From habitat mapping using remotely operated vehicles to species identification in vast image datasets, computer vision has become a powerful tool for addressing complex marine challenges. At the same time, these applications often push the boundaries of standard computer vision workflows, requiring adaptations to the unique conditions and questions inherent to marine environments. This paper emerges from years of interdisciplinary collaborations between marine and computer vision scientists who have worked together at the interface of these two domains. Rather than focusing on singles studies or innovations, we take a step back to reflect on the collaborative processes that underpin successful projects. We believe that there is substantial value in documenting not only what was achieved, but how interdisciplinary teams worked together to achieve it. First, we share collective lessons learned – both positive and cautionary – from real-world projects that required sustained communication across disciplinary boundaries. Second, we propose a set of practical guidelines for others entering similar collaborations, grounded in both technical and interpersonal considerations. Finally, we present a blueprint for structuring and sustaining effective partnerships between marine ecologists and computer vision scientists, offering a process-focused perspective that complements existing methodological literature. By bringing together perspectives from both fields, this paper seeks to contribute to the growing discourse on interdisciplinary research practices in environmental informatics and applied AI. We hope it will serve as a useful resource for researchers navigating similar partnerships, as well as for institutions and funders seeking to better support such work.

Authors: Stefano Schenone (The University of Auckland)*; Mihailo Azhar (Aarhus University); Jenny Hillman (The University of Aucklnad); Shahrokh Heidary (The University of Auckland); Tegan Evans (The University of Auckland); Patrice Delmas (The University of Auckland); Simon Thrush (The University of Auckland)

Paper 58 — SAViL-Det: Semantic-Aware Vision-Language Model for Multi-Script Text Detection

Abstract: Detecting text in natural scenes remains challenging, particularly for diverse scripts and arbitrarily shaped instances where visual cues alone are often insufficient. Existing methods do not fully leverage semantic context. This paper introduces SAViL-Det, a novel semantic-aware vision-language model that enhances ulti-script text detection by effectively integrating textual prompts with visual features. SAViL-Det utilizes a pre-trained CLIP model combined with an Asymptotic Feature Pyramid Network (AFPN) for multi-scale visual feature fusion. The core of the proposed framework is a novel language-vision decoder that adaptively propagates fine-grained semantic information from text prompts to visual features via cross-modal attention. Furthermore, a text-to-pixel contrastive learning mechanism explicitly aligns textual and corresponding visual pixel features. Extensive experiments on challenging benchmarks demonstrate the effectiveness of the proposed approach, achieving state-of-the-art performance with F-scores of 84.8% on the benchmark multi-lingual MLT-2019 dataset and 90.2% on the curved-text CTW1500 dataset.

Authors: Mohammed-En-Nadhir Zighem (Sorbonne University Abu Dhabi); Abdenour Hadid (Sorbonne University Abu Dhabi)*

Paper 59 — Unmasking Performance Gaps: A Comparative Study of Human Anonymization and Its Effects on Video Anomaly Detection

Abstract: Advancements in deep learning have improved anomaly detection in surveillance videos, yet they raise urgent privacy concerns due to the collection of sensitive human data. In this paper, we present a comprehensive analysis of anomaly detection performance under four human anonymization techniques, including blurring, masking, encryption, and avatar replacement, applied to the UCF-Crime dataset. We evaluate four SOTA anomaly detection methods on the anonymized UCF-Crime to reveal how each method respond to different obfuscation techniques. Experimental results demonstrate that anomaly detection remains viable under anonymized data, and is dependent on the algorithmic design and the learning strategy. For instance, under certain anonymization patterns, such as encryption and masking, some models inadvertently achieve higher AUC performance than on raw data, due to the strong responsiveness of their algorithmic components to these noise patterns. These results highlight the algorithm-specific sensitivities to anonymization and emphasize the trade-off between preserving privacy and maintaining detection utility. Furthermore, we compare these conventional anonymization techniques with the emerging privacy-by-design solutions, highlighting an often overlooked trade-off between robust privacy protection and utility flexibility. Through comprehensive experiments and analyses, this study provides a compelling benchmark and insights into balancing human privacy with the demands of anomaly detection.

Authors: Sara Abdulaziz (Eindhoven University of Technology)*; Egor Bondarev (Eindhoven University of Technology)

Paper 60 — Application of Conditional Neural Movement Primitive (CNMP) for Movement Decoding Using Brain Signals

Abstract: Current applications of Brain-Machine Interfaces (BMIs) primarily focus on decoding user intentions directed toward the final target, often neglecting the sequential actions required to reach that target. This limitation is largely due to the susceptibility of electroencephalogram (EEG) signals—commonly used in BMI systems—to noise. As a result, recent research has shifted its focus toward motor imagery tasks, often discarding motor execution-based BMI systems. In this work, we address the challenge of decoding detailed motion trajectories from brain signals. Rather than relying solely on noisy EEG signals, we integrate current movement information along with temporal features to incorporate time constraints into the learning process. Experimental results demonstrate that the proposed model can accurately predict the next target location from brain signals, producing trajectories that closely resemble the ground truth.

Authors: Genci Capi (Hosei University)*; Goragod Pongthanisorn (Hosei University); Genti Progri ( A. Xhuvani University); Shin-ichiro Kaneko (National Institute of Technology, Toyama College)

Paper 61 — Unsupervised Multi-Class Glioma Segmentation in 3D MRI Using Adaptive Thresholding and Hierarchical Clustering

Abstract: Accurate glioma segmentation in 3D MRI scans is crucial for diagnosis and treatment planning. However, supervised deep learning models are limited by the need for expert annotations. In this study, we introduce an unsupervised two-step approach for multi-class glioma segmentation eliminating the dependency on labeled data. First, a robust global tumor mask is extracted using adaptive thresholding (Sauvola) to dynamically adjust intensity variations, followed by morphological processing for noise removal and refinement. A strict 3D fusion across axial, coronal, and sagittal planes ensures spatial consistency, reducing segmentation artifacts. Second, an optimized hierarchical unsupervised clustering method (HUP-OAP) is applied to classify tumor regions into active tumor, edema, and necrosis based on multi-modal MRI intensity distributions. This method leverages affinity propagation based clustering, ensuring anatomical coherence and precise tumor subregion segmentation. Our method is evaluated on the BraTS 2021 dataset, achieving an average Dice score (DSC) of 13% and Hausdorff Distance (HD) of 3.58, demonstrating competitive performance compared to supervised deep learning methods without requiring manual annotations. The improved segmentation accuracy and reduced boundary errors have direct implication for clinical applications, including more precise surgical planning, minimizing damage to healthy tissue, enhanced radiotherapy targeting, ensuring optimal dose delivery, and improved treatment monitoring, aiding in better tumor progression tracking. Our results suggest that this approach offers a scalable, interpretable, and annotation-free alternative for automated glioma segmentation, making it highly suitable for clinical applications.

Authors: Jihan Alameddine (Poste Doctorante)*; Céline Thomarat (University Hospital center of Poitiers, Department of Imaging, ); Rémy Guillevin (University Hospital center of Poitiers, Department of Imaging, France/ Labcom I3M, Poitiers,); Christine Fernandez-Maloigne (Laboratory XLIM, CNRS UMR 7252/ Labcom I3M, Poitiers); Carole Guillevin (University Hospital center of Poitiers, Department of Imaging, France/ Labcom I3M, Poitiers)

Paper 62 — Secure Image Transmission in IoT Network Using Chaotic / Neural Network- Predictive S-Boxes for ASCON Lightweight Cryptography

Abstract: Securing image data in resource-constrained Internet of Things (IoT) envi-ronments necessitates lightweight cryptographic solutions. This paper pro-poses a novel image encryption model that integrates a neural network (NN) for dynamic Substitution Box (S-box) generation, the ASCON lightweight au-thenticated encryption algorithm, and a chaotic logistic map for deriving es-sential parameters. The NN, trained with a custom loss function, generates a 16×16 S-box based on the initial seed and control parameter of the logistic map. These same chaotic parameters also generate the 128-bit encryption key for the ASCON algorithm, which utilizes the NN-generated S-box in its substitution layer for image encryption. Experimental results demonstrate that the trained NN produces S-boxes with consistent cryptographic proper-ties across various chaotic parameter sets. The image encryption scheme ex-hibits strong security metrics, including high sensitivity to plaintext changes (NPCR ≈ 99.6%), significant diffusion (UACI ≈ 33.6%), near-ideal entropy (≈ 8 bits), and low correlation between adjacent pixels. The encryption and decryption speeds are also reported. While the initial implementation of the S-box generation faced challenges in ensuring bijectivity, the overall frame-work presents a promising approach for lightweight image encryption in IoT environments by leveraging the dynamic nature of NN-generated S-boxes driven by chaotic systems within the efficient ASCON algorithm

Authors: Zaydon Ali (El Manar University)*; Walid Barhoumi (Université de Tunis El Manar); Houcemeddine Hermassi (Université de Tunis El Manar)

Paper 63 — SV-GaSRelight: Single-View Gaussian Splatting for 3D Human Relighting

Abstract: In virtual environments, where dynamic and interactive 3D content is essential, realistic lighting plays a crucial role in enhancing immersion. However, traditional relighting methods often rely on multiview captures or computationally expensive neural rendering techniques, making them impractical for real-time applications. In this work, we propose SV-GaSRelight, a novel single-view 3D scene relighting approach specifically designed for human relighting. Our main contribution is to exploit human bodies in the scene as 3D probes that can help to decompose lighting conditions from surface albedo and orientations. This is possible thanks to very powerful models that are able to reconstruct 3D human bodies from a single view. Our approach reconstructs a 3D human model from a single image and generates synthetic multi-view data with corresponding camera poses. We then simulate realistic lighting effects by incorporating physically inspired reflectance modeling and light transport into the 3D representation. To introduce lighting information into this representation, we leverage a neural-based light field mechanism. We assess the effectiveness of our approach by comparing it to an existing method that requires multi-view inputs for relighting. Additionally, we conduct a user study to evaluate the perceived realism of the rendered scenes. Experimental results demonstrate that our single-view-based method achieves comparable visual quality to multi-view relighting techniques. Our project is available on https://sj-programming.github.io/SV-GaSRelight/.

Authors: Sonain Jamil (University Jean Monnet (UJM), Saint-Etienne)*; Damien Muselet (University Jean Monnet (UJM), Saint-Etienne); Alain Tremeau (University Jean Monnet (UJM), Saint-Etienne); Philippe Colantoni (University Jean Monnet (UJM), Saint-Etienne)

Paper 64 — Dynamic Chaotic-ASCON Encryption: ECG Security in Resource-Constrained IoT

Abstract: The protection of electrocardiogram (ECG) signals is a critical requirement in Internet of Things (IoT)-enabled healthcare systems, where resource con-straints necessitate secure yet lightweight cryptography. This paper proposes a new encryption framework that elegantly incorporates chaotic dynamics with the ASCON-1 authenticated encryption algorithm to secure ECG trans-missions. The essential innovation is the dynamic derivation of encryption keys using chaotic permutations created by logistic map sequences, resulting in increased unpredictability and resistance to cryptanalytic attacks. Prior to encryption, ECG signals are normalized and quantized, then securely trans-formed using ASCON-1, which uses linked data to ensure integrity and con-fidentiality. The scheme’s efficacy is confirmed by experimental results, which show great encryption quality—as indicated by high peak signal-to-noise ratio (PSNR), low bit error rate (BER), and excellent signal correla-tion—while requiring minimal computational and memory cost. The sug-gested method, designed for restricted IoT contexts, provides a secure, effi-cient, and scalable alternative for protecting biomedical data in modern healthcare infrastructures.

Authors: Zaydon L. Ali (El Manar University)*; Walid Barhoumi (SIIVA, LIMTIC Lab, ISI, University at Tunis El Manar, Tunisia; ENICarthage, University. of Carthage, Tunisia); Houcemeddine Hermassi (Ecole Nationale d’Ingénieurs de Tunis, Université de Tunis El Manar, Tunis, Tunisia)

Paper 65 — Dance Style Recognition Using Laban Movement Analysis

Abstract: The growing interest in automated movement analysis has presented new challenges in recognition of complex human activities including dance. This study focuses on dance style recognition using features extracted using Laban Movement Analysis. Previous studies for dance style recognition often focus on cross-frame movement analysis, which limits the ability to capture temporal context and dynamic transitions between movements. This gap highlights the need for a method that can add temporal context to LMA features. For this, we introduce a novel pipeline which combines 3D pose estimation, 3D human mesh reconstruction, and floor aware body modeling to effectively extract LMA features. To address the temporal limitation, we propose a sliding window approach that captures movement evolution across time in features. These features are then used to train various machine learning methods for classification, and their explainability explainable AI methods to evaluate the contribution of each feature to classification performance. Our proposed method achieves a highest classification accuracy of 99.18% which shows that the addition of temporal context significantly improves dance style recognition performance.

Authors: Muhammad Turab (Laboratoire Hubert Curien – UMR 5516, Saint-Etienne)*; Philippe Colantoni (Laboratoire Hubert Curien – UMR 5516, Saint-Etienne); Damien Muselet (Laboratoire Hubert Curien – UMR 5516, Saint-Etienne); Alain Trémeau (Laboratoire Hubert Curien – UMR 5516, Saint-Etienne)

Paper 69 — Context-Aware Vision Language Model for Action Recognition

Abstract: Access to life-saving humanitarian medicine in austere and resource-constrained environments is crucial to ensuring the health and safety of many around the world. However, access to high-quality medical care in these conditions is challenging due to the absence of connectivity and power, limited equipment, and the lack of available medical expertise. To address these constraints, we propose a memory-augmented Vision-Language Model (VLM) for action recognition and action anticipation to assist in medical decision-making and provide real-time intraoperative guidance to first responders and medical professionals performing life-saving interventions. To enhance the accuracy of predictions on the action recognition and action anticipation tasks, our proposed architecture consists of two modules: the vision module and the language module. The vision module is based on the MViTv2 transformer, and its outputs are fed into the language module. Because medical actions are often consecutive, with one action consistently performed after another, we augment the language module with memory of previous actions through concatenating all past actions recognized or anticipated by the vision module. This string is then passed into the language module, based upon the FlanT5 Large Language Model (LLM) architecture, which makes the final prediction on the recognized or anticipated action. The open-source Trauma THOMPSON dataset, which consists of unscripted egocentric videos of emergency trauma surgery procedures, is employed for model training and validation. Our proposed architecture achieves a top1 accuracy of 69.11% on action recognition and 62.13% on action anticipation, outperforming the results reported by previous literature.

Authors: Eddie Zhang (The Harker School)*; Yupeng Zhuo (Purdue University ); Juan Wachs (Purdue University)

Paper 70 — Deep Isoline Attack for Imperceptible Adversarial Perturbation on Face Recognition Systems

Abstract: Facial recognition systems are widely used in critical areas such as authentication and access control, but they remain vulnerable to adversarial attacks that can severely compromise their reliability. We introduce a novel hybrid attack framework that combines both geometric and intensity-based perturbations to effectively fool facial recognition models. Our approach begins with isoline detection to identify facial contours critical for identity discrimination. These contours are strategically perturbed using geometric transformations to disrupt structural features, while localized intensity-based adversarial masks are applied to high-saliency regions (e.g., eyes, mouth) to maximize feature confusion. Affecting only 10% of facial pixels, our method achieves an optimal balance between attack potency and imperceptibility. Evaluations on two benchmark datasets demonstrate that the proposed framework outperforms state-of-the-art adversarial techniques under black-box settings. These findings highlight critical vulnerabilities in current systems and underscore the urgent need for robust defense mechanisms against multifaceted adversarial strategies.

Authors: Emna BenSaid (RegimLab)*; Jabberi Marwa (RegimLab); Mohamed Neji (RegimLab); Adel Alimi (RegimLab)

Paper 71 — Automatic and Interactive Annotation of Non-Manual and Spatial Features in Pidgin Sign Japanese for SLR

Abstract: Pidgin Sign Japanese (PSJ), an intermediate form between Japanese Sign Language (JSL) and Manually Coded Japanese (MCJ), presents unique challenges for annotation due to its reliance on non-manual elements such as head movements and facial expressions. Building upon our previous work on automating the annotation of these elements, we present an improved annotation tool that enhances both detection accuracy and annotation efficiency. The tool significantly reduces manual effort by leveraging state-of-the-art methods for tracking human pose, hand, and face landmarks, along with recognizing facial action units (FAUs). Furthermore, it introduces interactive refinement capabilities to address common issues such as missing or inaccurate hand keypoints, including copy-pasting across frames and interpolation of hand positions. Evaluations on a preliminary PSJ dataset demonstrate a reduction in annotation time and improved usability compared to previous tools, supporting the creation of high-quality PSJ corpora for future Sign Language Recognition (SLR) research.

Authors: Gibran Benitez-Garcia (The University of Electro-Communications)*; Nobuko Kato (Tsukuba University of Technology); Yuhki Shiraishi (Tsukuba University of Technology); Hiroki Takahashi (The University of Electro-Communications)

Paper 72 — RGC-TinyUNet++: Dual-Stage Segmentation for Accurate Early Detection of Mammary Microcalcifications

Abstract: Microcalcifications are considered the first radiographic indicators of breast cancer and remain significant biomarkers for diagnosis and prognosis. Thus, their reliable detection of mammograms is a key challenge in computer-aided breast imaging. Although deep learning has advanced recently, accurate segmentation of microcalcifications remains challenging, as deep models often miss these fine, low-contrast structures in heterogeneous breast tissue. This study proposes RGC-TinyUNet++, a dual-stage approach combining preprocessing with gamma correction and CLAHE enhancement, followed by ROI extraction focused on 32×32 extremity-centered patches. Initial segmentation is achieved via constraint-guided region growing (RGC), providing a preliminary mask, which is then refined through a lightweight Tiny-UNet model. Evaluated on a mammographic dataset, results show that our dual-stream strategy not only improves sensitivity to small-scale features but also maintains high precision in segmenting complex anatomical backgrounds. The proposed approach offers strong potential to enhance CAD systems for earlier and more accurate breast cancer diagnosis.

Authors: Ikram BEN AHMED (ReGimLab)*

Paper 74 — Detecting StyleGAN-Generated Deepfake Faces with Vision Transformers and Latent Attention

Abstract: The rapid advancement of Generative Adversarial Networks (GANs), particularly StyleGAN, has led to an unprecedented increase in highly realistic synthetic images. While this technology opens up exciting opportunities across various fields, it poses significant challenges to digital security and content authenticity. To address this issue, our study focuses on developing a robust method for detecting facial images generated by StyleGAN. We propose an optimized Vision Transformers (ViT) model that leverages transfer learning and incorporates a latent attention module. This approach enhances the model’s detection capabilities, effectively identifying StyleGAN-generated images. A comprehensive evaluation, which includes large datasets of real and generated images, demonstrates the model’s remarkable performance. The proposed model achieves an accuracy of 99.83%, an AUC of 1, and an F1-score of 0.9983. Furthermore, the model exhibits strong generalization abilities on external datasets, confirming its efficacy in various deepfake detection scenarios. Furthermore, due to the late attention integration, the computational cost can be reduced by 42%, achieving an 85% reduction for a specific dataset. We extensively validated our approach on three diverse StyleGAN-generated deepfake datasets and compared its performance to six baseline methods, demonstrating its superiority in detecting StyleGAN-generated deepfakes and its contribution to digital content authentication and synthetic image detection.

Authors: Sokhna Bally Touré (UQAC); Haïfa Nakouri (ISG Tunis)*; Bob-Antoine Jerry Ménélas (UQAC)

Paper 75 — UNETRSal: Saliency Prediction with Hybrid Transformer-Based Architecture

Abstract: Saliency prediction plays a critical role in understanding visual attention as it is a cornerstone for both natural scene understanding and automated document analysis. In this work, we propose the UNETRSal model for saliency prediction. Based on UNETR transformer-based model, we introduce a new decoder to increase efficiency on 2D images. Comprehensive evaluations on benchmark datasets, such as SALICON and CAT2000, demonstrate that UNETRSal achieves state-of-the-art performance across multiple saliency metrics, surpassing both conventional CNN-based and transformer-based methods. These results not only underscore the strengths of hybrid transformer architectures in modeling visual attention but also highlight the potential impact on advancing document representation modeling and layout analysis.

Authors: Azamat Kaibaldiyev (University of Caen)*; Jérémie Pantin (University of Caen); Alexis Lechervy (University of Caen); Fabrice Maurel (University of Caen); Youssef Chahir (University of Caen); Gaël Dias ( University of Caen)

Paper 77 — Beyond Face Blurring: Privacy-Preserving Surveillance via Homomorphic Encryption and Encrypted Facial Representations

Abstract: Public surveillance systems are necessary for security yet pose significant privacy risks due to unauthorized facial recognition. Traditional privacy-preserving meth-ods, such as face blurring, are not practical in preventing the leakage of identity. This paper presents a novel framework that will enhance privacy preservation in surveillance through the integration of deep learning-based facial feature extraction, adversarial encryption, and homomorphic encryption CKKS. The approach first detects faces using YOLOv8, followed by extracting features with ResNet-18 that provides a 512-dimensional facial embedding. This work conceptually integrates CKKS homomorphic encryption to illustrate where secure facial feature processing could be applied in future systems. In the current prototype, actual encryption is not performed; instead, unencrypted 512-dimensional facial embeddings are visually projected onto the face regions to simulate anonymization. This provides privacy protection without exposing raw pixel data or identity-related visual information. Experimental results on real-world datasets validate the robustness of the encrypted representation against facial recognition models while maintaining video integrity. The proposed framework strikes a balance between security and privacy, thus providing a real-time privacy-preserving solution for surveillance applications. Optimization of computational efficiency and extending the approach to multimodal biometric systems are some of the future research directions.

Authors: Sajid Ahmed (Saitama University, Japan)*; Yoshiura Noriaki (Saitama University, Japan )

Paper 79 — Advancing Cybersecurity with Liquid Neural Networks: Robustness and Efficiency in IDS

Abstract: Applications in cybersecurity are increasingly leveraging machine learning and deep learning models, creating a need for solutions that balance accuracy, robustness, and computational efficiency. In this study, we evaluate and compare three prominent sequential models: Recurrent Neural Networks (RNN), Long Short-Term Memory (LSTM) networks, and Liquid Neural Networks (LiquidNet) based on static training performance, dynamic testing resilience, and computa-tional efficiency using the benchmark datasets NSL-KDD and KDD-CUP99. Dur-ing static training with the NSL-KDD dataset, LiquidNet achieved an accuracy of 98.77%, slightly surpassing RNN (98.71%) and approaching LSTM’s perfor-mance (99.21%). On the KDD-CUP99 dataset, Liquid-Net again outperformed both RNN and LSTM, achieving 99.60%, compared to 99.38% and 99.36%, re-spectively. Furthermore, in dynamic testing, LiquidNet demonstrated superior performance, achieving dynamic accuracies of 95.7% on NSL-KDD and 98.6%, surpassing RNN (84% and 92%) and LSTM (78% and 90.8%). LiquidNet also outperforms RNN and LSTM in terms of computational efficiency, with inference times approaching nearly 0 seconds. In terms of noise robustness, LiquidNet demonstrated superior performance, exhibiting greater tolerance to high noise levels, which resulted in higher accuracy and F1 scores under various noise con-ditions. This research offers a pathway for exploring the trade-offs between accu-racy, robustness, and efficiency in sequential networks, highlighting LiquidNet as a strong candidate for deployment in time-critical applications and environments prone to noise interference.

Authors: Noor Saud (Tunis El-Manar)*; Kamel Karoui (Carthage University)

Paper 80 — Extended Reality-Driven Testbed for Innovative Remote Drone Operations in Disaster Scenarios

Abstract: The teleoperation of unmanned aerial vehicles (UAVs) has traditionally been performed using joystick controllers and relying on the direct perception of the operator. However, the evolution of extended reality (XR) in recent years has allowed for more intuitive control methods while providing various features and additional information about the environment to improve the operator’s decision-making abilities. Although in previous works the advantages of using XR technology for UAV control have been pointed out, extensive further experimentation would be required to objectively compare these new methods. This study aims to fill this research void by presenting an experimental framework with tracks mimicking different disaster scenarios. The successful completion of a track involves both the placement of virtual obstacles related to the test scenario and the traversing of a challenging path. The practical utility of the framework was validated by comparing two essentially different XR-based drone control methods on various tracks based on the built-in performance evaluation metrics.

Authors: Barnabás Bugár-Mészáros (SZTAKI)*; Géza Szeghy (SZTAKI); András Majdik (SZTAKI)

Paper 81 — SABSE: Segmentation-Assisted Baseline Shapley Values

Abstract: We introduce Segmentation-Assisted Baseline Shapley Values (SABSE), a novel model‐agnostic framework for generating explanations of black-box object detection models. Unlike conventional methods that rely on generic superpixels, our approach leverages semantically meaningful segments generated by the Segment Anything Model (SAM) to more accurately capture critical image regions. By integrating SHAP-based attribution with advanced segmentation, SABSE produces explanations that better align with human perception and enhance the interpretability of object detection models decisions. Additionally, we explore various segment replacement strategies to assess the impact of feature removal on detection performance, quantified by an adapted insertion score for object detection models. Experimental evaluations across multiple object categories demonstrate that SABSE not only improves explanation fidelity but also provides deeper insights into the decision-making process of detection models. These findings open promising avenues for enhancing transparency and trust in real-world AI applications.

Authors: Marcel Henkel (Fraunhofer IOSB)*; Simon Keilbach (Fraunhofer IOSB); Nadia Burkart (Fraunhofer IOSB); Arne Schumann (Fraunhofer IOSB)

Paper 85 — Mammography Lexicon-based Explainable Artificial Intelligence for Diagnosis and Visual Interpretation of Breast Cancer

Abstract: The integration of Artificial Intelligence (AI) technologies has revolutionized various aspects of the healthcare field. As transparency and accountability in this field are of the utmost importance, eXplainable AI (XAI) has attracted increasing attention. Within the context of diagnostic interpretation, XAI can be exploited to reveal the reasoning behind an AI-based system’s decision. In this study, we prioritize explaining model predictions to medical experts for better interpretability and focus on advancing breast cancer diagnosis through interpretable deep learning models and rigorous analysis of results. The study employs various relevant deep learning and transfer learning models, while investigating an expert mammography lexicon for explainability on the challenging MIAS dataset. Moreover, extensive experiments have been performed in order to assess the influence of several criteria such as dataset balancing, chosen deep learning model, and the selected XAI technique. Reported results prove that Local Interpretable Model-agnostic Explanations (LIME) provide satisfactory results in explainability, compared to the Shapley Additive Explanations (SHAP) technique, after balancing the dataset and selecting the most appropriate deep learning model. Additionally, it should be noted that emphasizing the importance of the region of interest is crucial to avoid incorporating the background within mammograms as a predictive feature.

Authors: Ons Loukil (LIMTIC)*; Abir Baâzaoui (LIMTIC, Université de Kairouan); Walid Barhoumi (LIMTIC, Université de Carthage)

Paper 89 — Weakly Supervised Blue-Carbon Mapping of Rāhui Reefs with SAM-Bootstrapped nnU-Net

Abstract: Salt marshes, mangroves, and seagrass, coral reefs play a role in the sequestration of atmospheric CO2, leading to the concept of “blue carbon”. Routinely assessing such blue-carbon ecosystems requires labour-intensive diver surveys. To alleviate the analysis of the video footage recorded by our team of divers, we introduce an automated weakly supervised computer vision pipeline that turns commodity 1080P underwater video into per-species distribution maps and coarse seafloor CO2 indices. First, the method prompts the Segment Anything in High Quality (HQ-SAM) with a handful of user clicks to generate pseudo-masks for ten ecologically indicative benthic taxa. These masks bootstrap four class-specific nnU-Net models that refine species boundaries. Finally, the carbon content, as low/medium/high concentration indicators, was estimated in each frame, producing site-level heatmaps suitable for temporal comparison. Experiments were run on 412 manually curated frames from coastal sites in Northland, Aotearoa New Zealand. The final image segmentation results showed mean Intersection-over-Union (mIoU) and Dice coefficients of 0.55 and 0.83, respectively.

Authors: Jiaxuan Wang (The University of Auckland); Ruigeng Wang (Intelligent Vision Systems Lab (IVSLab), The University of Auckland); Shahrokh Heidari (Intelligent Vision Systems Lab (IVSLab), The University of Auckland); David Arturo Valdez ( Intelligent Vision Systems Lab (IVSLab), The University of Auckland); Patrice Delmas (Intelligent Vision Systems Lab (IVSLab), The University of Auckland)*

Paper 92 — Selective Multiple Reference Frame Approach for VVC Standard

Abstract: Versatile Video Coding is the latest video coding standard that introduces advanced techniques, notably the Multiple Reference Frame (MRF) tool aimed at enhancing Inter-prediction coding efficiency. However, integrating MRF escalates the coding process’s complexity. This paper proposes a novel method to simplify MRF selection of the VVC encoder. The proposed algorithm restricts the exhaustive motion vector search through all possible reference frames to the parent coding tree unit. The selected reference frame is then utilized as the sole reference during Inter-prediction processes for all the resulting children coding units from the partitioning process. Tested on the VVC reference software VTM-14 under a Random Access (RA) configuration, This method demonstrates a significant average reduction of 56.063% in the runtime of the motion estimation module and an overall encoder runtime enhancement of 9.385% with a negligible loss in the reconstructed video quality and bitrate.

Authors: Taheni Dammak (ENETCOM)*; Rana Jassim Mohammed (College of Education for Pure Sciences, University of Basrah, Basrah, Iraq); Mohamed Ali Ben Ayed (New Technologies and Telecommunications Systems (NTS’Com), ENET’Com)