Abstract
Augmented reality (AR) enhances the user’s perception of the real environment by superimposing virtual images generated by computers. These virtual images provide additional visual information that complements the real-world view. AR systems are rapidly gaining popularity in various manufacturing fields such as training, maintenance, assembly, and robot programming. In some AR applications, it is crucial for the invisible virtual environment to be precisely aligned with the physical environment to ensure that human users can accurately perceive the virtual augmentation in conjunction with their real surroundings. The process of achieving this accurate alignment is known as calibration. During some robotics applications using AR, we observed instances of misalignment in the visual representation within the designated workspace. This misalignment can potentially impact the accuracy of the robot’s operations during the task. Based on the previous research on AR-assisted robot programming systems, this work investigates the sources of misalignment errors and presents a simple and efficient calibration procedure to reduce the misalignment accuracy in general video see-through AR systems. To accurately superimpose virtual information onto the real environment, it is necessary to identify the sources and propagation of errors. In this work, we outline the linear transformation and projection of each point from the virtual world space to the virtual screen coordinates. An offline calibration method is introduced to determine the offset matrix from the head-mounted display (HMD) to the camera, and experiments are conducted to validate the improvement achieved through the calibration process.
1 Introduction
The augmented reality (AR) technique superimposes computer-generated graphics onto the physical environment, such as texts, 3D graphics, and animations [1]. AR can be classified into three categories based on the hardware devices used [2]: hand-held device-based, head-mounted display (HMD)-based, and projector-based. The findings from Refs. [3,4] indicated that wearable devices, particularly see-through HMDs, not only provide greater accuracy but also deliver an immersive user experience that significantly enhances human perception and interaction in both physical and virtual environments. The advantages of HMD AR can be attributed to the wearable system’s ability to overlay virtual instructions directly into the operator’s field of view (FOV), which provides a self-centered perspective and an egocentric viewpoint [5,6]. Furthermore, according to Ref. [7], the HMD AR demonstrates mobility and hands-free capabilities, particularly in the context of industrial applications. HMD AR has been effectively deployed across diverse manufacturing domains, encompassing areas such as training [8], maintenance [9], design [10], assembly [11], human-robot interaction [12], and remote assistance [13].
AR techniques have been effectively employed in robot programming tasks, enabling both collaborative operations at the same location and teleoperation from a distance, as illustrated in Fig. 1. A prevalent scenario involves the utilization of AR for robot programming in a shared workspace where collaborative tasks are carried out in close proximity. The integration of AR in robotics has demonstrated its potential to enhance the accuracy and efficiency of robot operations by providing supplementary information and visual aids [14]. AR merges the virtual world with the physical environment, allowing human operators to interact in the virtual realm by setting virtual waypoints for the tool center point (TCP) and defining a collision-free volume [12,15]. Subsequently, the robot perceives these inputs based on the augmented environment. However, the successful execution of robot actions relies on the accurate interaction and placement of virtual objects by humans with respect to the physical environment, as any misregistration in the AR scene can introduce significant errors caused by human factors [16]. In other words, the interaction of virtual objects is intricately interconnected with the process of perception [17]. Therefore, the effective implementation of AR in robotics necessitates precise alignment and human perception of computer-generated graphics within the physical environment.

Two types of applications of augmented reality in robot operation tasks are illustrated: (a) collaborative operations at the shared workspace and (b) teleoperation from a distance
For well-registered AR scenes, the spatial co-localization of virtual objects and the physical environment should be achieved, and these virtual objects should be interacted with as they are in the physical environment. Thus, there is a requirement for the virtual elements to register in the physical environment and remain stable from different points of view. The precise alignment of computer-generated objects and panels with the physical world in AR systems is referred to as registration. In other words, registration is to optimally align two or more rigid bodies by estimating the best transformation between them [18]. The calibration is the process that enables this goal and ensures precise virtual widgets and animations in AR operations. Nevertheless, misregistration can arise from disparate AR systems and a multitude of factors.
Presently, commercial AR HMD systems available on the market comprise optical see-through (OST) HMDs and video see-through (VST) [19]. While an OST HMD allows users to directly see the physical world through a transparent screen and the virtual content displayed on the screen simultaneously, the VST HMD displays the whole scenario with the physical world captured by cameras and integrated with virtual content [20]. The primary market-available AR HMDs generally fall into these two categories, such as Google Glass (OST), HoloLens (OST), Magic Leap (OST), Meta Quest Pro (VST), Varjo XR-3 (VST), and the recently released Apple Vision Pro (VST). OST HMDs usually provide a better depth perception of the real-world, but they suffer from the limited FOVs and challenges of occlusion handling [21]. On the other hand, VST HMDs have outstanding rendering features to handle the occlusion and the consistency between the real and synthetic views [21]. Both OST and VST HMDs face challenges related to misregistration according to their different visualization methods. One essential factor contributing to misregistration is depth perception, which has been well-recognized based on the established optical models of different HMDs. There are many factors other than depth that may also affect the alignment of human perception and the virtual content, including camera pose, head pose, eyeball pose, eye focal length, and image plane position. Nevertheless, it is unclear how these factors affect the correct registration of virtual content in the physical world.
In our previous work [22,23], registration errors were observed in the VST AR interface where 3D virtual objects appeared to float differently based on the observer’s viewpoint, as shown in Fig. 2. This issue caused significant errors in robot programming as user-defined virtual waypoints shifted, leading to confusion during task completion. An error propagation model was developed in our previous work [24], categorizing the error sources into seven categories, including camera extrinsic calibration, virtual object alignment, display, tracking, camera intrinsic calibration, rendering, and tracker alignment errors. However, a research question is still open: what are the specific impacts of these error sources on misregistration in a VST AR system? To answer this research question, we analyzed the error sources in VST AR systems in detail, established a mathematical model of errors, proposed a calibration method to reduce the errors, and conducted both qualitative and quantitative evaluations of the error model and the calibration method. The detailed contributions are listed as follows:
A detailed analysis of the error sources in the VST AR systems with insights into the factors affecting misregistration.
A mathematical model that effectively captures and explains the impact of errors on the overall system performance, specifically focusing on the HMD-to-camera transformation.
A global calibration method to rectify the identified errors and improve the alignment accuracy between virtual objects and the physical environment.
Qualitative and quantitative evaluations to validate the effectiveness of the mathematical model and the calibration method, ensuring the reliability of the calibration process and its impact on misregistration reduction.

(a A collaborative augmented reality robot work cell featuring a virtual flip pyramid as a visual representation of robot's tool center point (TCP), and (b)–(e) demonstrate the observed misregistration between the virtual widget and the physical calibration tool from varying viewpoints
The structure of the paper is as follows: In Sec. 2, the relevant literature on addressing the registration problem in general AR systems is reviewed. Section 3 provides an overview of the applied AR system and identifies the sources of error in this model based on a pilot study. In Sec. 4, a mathematical model is proposed to explain how errors affect the results, and Sec. 5 presents a global calibration method for correcting registration errors. The implementation and results of the calibration are described in Sec. 6, and the conclusion and potential future directions are outlined in Sec. 7.
2 Related Works
HMDs that provide immersive user experiences, also referred to as mixed reality displays, can be broadly classified into two categories based on their approaches, as outlined in the survey [2]. The first category is VST AR, where users indirectly perceive their physical surroundings through an augmented video feed from cameras. This AR system continuously captures real-world image frames through cameras and superimposes computer-generated graphics as virtual content onto these frames. Users view the augmented image frames on a screen or display. On the other hand, the second category is OST AR, where users have a direct view of the physical world through semi-transparent displays or optical elements, while simultaneously, virtual content is overlaid onto the screen. OST AR devices typically utilize waveguides or holographic optical elements to project virtual content seamlessly into the user’s perception, creating the illusion that it is an integral part of the real-world. The primary distinction between VST AR and OST AR lies in their approaches to combining and presenting virtual information to users. The OST AR system exhibits certain limitations, including a restricted augmentable field of view, device obtrusiveness, the requirement for frequent HMD recalibrations, the low luminance of the microdisplays, and potential perceptual conflicts [6].
In the past decade, a significant number of researchers have devoted their efforts to tackling the perceptual issues inherent in wearable AR devices. Depth perception is a crucial issue in affecting interaction [26]. Diaz et al. [27] reviewed design factors that influence depth perceptions in the context of see-through AR HMDs. The utilization of depth cues [28], which leads to accurate estimation in monoculars or binoculars, allows objects to appear in their intended spatial positions in augmented reality applications. Currently, AR HMDs offer the capability to harness binocular cues by rendering two distinct images, simulating the left and right eye perspectives of the virtual object within the real-world context [27]. This approach enables the creation of a visual experience that mimics the natural binocular vision of human observers. Therefore, extensive research is dedicated to addressing the issue of cue loss in AR content, including critical depth cues such as binocular disparity, binocular convergence, accommodative focus, relative size, and occlusions [29].
In the realm of stereoscopic VST AR HMDs, the user’s perception of the 3D world is reliant upon the interplay between two distinct optical systems: the acquisition camera and visualization display [30]. To achieve ortho-stereoscopic functionality in binocular VST HMDs, researchers have put forth several crucial conditions [31,32]. These include aligning the center of projection for both the cameras and displays with the centers of projection of the user’s eyes, precise alignment of the left and right optical axes of the displays with the corresponding optical axes of the cameras, equating the distances between the left and right cameras as well as between the left and right displays to the observer’s inter-pupillary distance, and ensuring that the FOV of the displays aligns with the FOV of the cameras. However, commercially available HMDs do not presently offer a means to achieve a genuine orthoscopic view devoid of geometric aberrations [17].
One limitation arises from the necessity for both the cameras and lenses to physically converge at the focal point, resulting in a toed-in HMD configuration [17]. However, commonly available VST HMDs adopt a parallel setup with fixed cameras and lenses, which introduces geometric aberrations, including diplomatic vision. Nevertheless, techniques have been developed to mitigate the impact of these geometric distortions [32,33]. Therefore, numerous studies have employed custom-made VST AR devices within the research domain to achieve a stereoscopic and egocentric solution. For instance, the ZED Mini from Stereolabs [34], recognized as the world’s first external camera designed specifically for AR applications [35], has found widespread use in various research areas [36–39]. However, most studies directly attach the camera to the VR HMD using a manufacturer mount without calibration, which introduces an additional camera extrinsic error for different HMDs. Thus, a calibration or correction process becomes necessary for such a VST AR system.
In AR devices, calibration is required to mitigate geometrical distortions resulting from the HMDs. This process involves estimating the camera’s parameters and aligning various coordinates to ensure accurate correspondence between virtual and physical objects [40]. Grubert et al. [41] provided a comprehensive summary of calibration procedures in OST AR systems according to manual, semi-automatic, and automatic approaches. Manual calibration methods often involve aligning a target object or 2D marker, which introduces the possibility of input errors during subjective operations. In manual categories, alignment setups can be categorized as either environment-centric or user-centric. In the environment-centric setup, targets are positioned at pre-determined locations [42]. Users are required to change the line-of-sight or on-screen reticle to align the target. Conversely, in the user-centric alignment, the user stays at the static location and line-of-sight, while the target is movable. However, the calibration process is susceptible to a significant number of errors due to human factors, necessitating the evaluation through user studies [43].
VST AR systems encounter a similar challenge. Interacting in AR environments through VST HMD devices poses an essential issue due to the discrepancy between the human eye and the camera’s intrinsic parameters (e.g., resolution limitations and optical distortions), thereby impeding accurate estimation of egocentric distances [29,33]. Since this problem is hard to solve, many researchers try to align the object on the image to approximately achieve registration. Given the complexity of this problem, numerous researchers strive to address it by attempting to align the object in the image to achieve approximate registration. The 3D registration problem is normally known as the simultaneous pose and correspondence (SPC) problem. Many works are proposed to solve the SPC problem using Expectation–Maximization (EM) algorithms, which apply an alternative minimization procedure. In this domain, some researchers focus on the iterative closest point algorithm [44–46], Softassign algorithm [47], and variants [48]. However, EM-type algorithms show shortcomings like local minima and rely on good initialization. Li and Hartley [18] proposed a global search solution for 3D-3D registration problems. In fact, the broad applicability of these algorithms in AR robot applications is limited, mainly due to their reliance on environmental information. For example, these algorithms typically necessitate a known object with 3D model information acting as a fiducial marker, which demands precise data collection and pre-training. Moreover, the calibration methods discussed in this context rely on the presence of specific physical objects for real-time calibration. However, a notable challenge arises when the target object is no longer within the field of view, as it can lead to substantial errors in the calibration process.
Fuhrmann et al. [49] introduced a rapid calibration method suitable for optical see-through and video-based AR HMDs with a straightforward implementation. Subsequently, researchers have explored automated vision-based verification and alignment methods, often employing fiducial markers during the calibration process [50]. Hu et al. [51] proposed a unique approach by calibrating misregistration using the bare hand as the alignment target. Nevertheless, various calibration methods have relied on user alignment of a target object or 2D marker, which introduces potential input errors due to the object’s six degrees-of-freedom. As a result, the calibration process needs to account for human perception. Therefore, it is imperative to identify the factors that contribute to misregistration and thoroughly investigate their effects and consequences.
3 System Implementation and Registration Framework
In our previous research works [23,24], we identified registration errors occurring within the AR scene during the operation of a human-robot AR interface. The observed phenomenon involved the floating of 3D virtual objects based on different viewpoints, as depicted in Fig. 2. Essentially, the expected stationary nature of the virtual content was compromised by shifting movements corresponding to the observer’s motion. This issue can significantly impact robot programming, as the user-defined virtual waypoints undergo shifts that may cause confusion in the user’s perception and hinder task completion.
We implemented a VST AR system comprising a stereo camera, the ZED mini, and a virtual reality headset, the Meta Quest 1st generation. The system framework is illustrated in Fig. 3. The Meta Quest 1st generation headset offers a diagonal FOV of 115 deg and a resolution of 2880 × 1600 pixels (1440 × 1600 per eye). The ZED mini, mounted on the front of the headset with manufacturer accessories, captures real image frames from the physical environment for AR video passthrough. The HMD is streamed using the ZED mini, which provides a 104 deg horizontal FOV and a resolution of 2560 × 720 pixels. The Oculus Quest headset utilizes the inside-out tracking system, Oculus Insight, to track its motion. The main system operates on the Unity 3D platform of a PC, serving as the graphic engine. A virtual camera is synchronized with the real camera’s movement to capture virtual image frames, generated by the computer graphics system based on the 3D virtual environment. The real and virtual images are merged with occlusion properties using depth information in the unity rendering pipeline, and the resulting AR images are visualized on the HMD. In this system, the stereo camera functions as two separate monocular systems, enabling the realization of binocular disparity and providing an immersive user experience. The system was tested on a laptop with the following specifications: Intel(R) Core(TM) i7-9850H @ 2.60GHz processor, 16GB of RAM, NVIDIA Quadro RTX 4000 GPU, and Windows 10 Enterprise operating system.
To investigate the registration error further, we implemented a small virtual ball at the tip of a calibration tool, as depicted in Fig. 4(a). The ball serves as a registration marker, aligning with the tip of the calibration tool in a hypothetical scenario where they remain stationary regardless of the observer’s viewpoint. The observer remains still at a distance of 0.5 m, except for rotating their head to allow the registration marker to appear in different zones of the AR image. Subsequently, the virtual ball exhibits movement and fails to align with the calibration tool, as observed in Figs. 4(b)–4(d). Based on our analysis of error propagation [24], we attribute the primary source of the misregistration error to the camera’s extrinsic error. From the observation of misregistration, we hypothesize that the camera’s extrinsic error undermines the final AR synthesis.

A misregistration case in a stereo video see-through AR system: (a) initially, a stationary red virtual bass is placed on the tip of a physical calibration tool, and (b)–(d) keep the viewer rotating in place, which leads to the ball and calibration tool moving to different positions on the image. The ball moves in different directions and doesn’t align with the calibration tool.

A misregistration case in a stereo video see-through AR system: (a) initially, a stationary red virtual bass is placed on the tip of a physical calibration tool, and (b)–(d) keep the viewer rotating in place, which leads to the ball and calibration tool moving to different positions on the image. The ball moves in different directions and doesn’t align with the calibration tool.
To gain a comprehensive understanding of and mitigate the registration errors, a registration framework for video see-through augmented reality is proposed. This framework encompasses various modules, as illustrated in Fig. 5. In a typical VST AR system, all devices need to be registered within a common reference coordinate, denoted as the world coordinate in Fig. 5(a). While the camera intrinsic parameters play a role, the primary source of extrinsic errors lies in the relative transformation (V in Fig. 5(a)), which affects the generation of accurate virtual content images. This relative transformation is influenced by the registration of each element in the world coordinate.

Registration frameworks in VST AR and AR-based robot applications: (a) in a typical VST AR registration framework, all devices are aligned and registered in a world coordinate, serving as a unified reference coordinate and (b) for AR-based robot applications, a general and efficient registration framework is proposed. In this framework, all devices are calibrated and synchronized to the HMD tracking coordinates, establishing a common reference frame.

Registration frameworks in VST AR and AR-based robot applications: (a) in a typical VST AR registration framework, all devices are aligned and registered in a world coordinate, serving as a unified reference coordinate and (b) for AR-based robot applications, a general and efficient registration framework is proposed. In this framework, all devices are calibrated and synchronized to the HMD tracking coordinates, establishing a common reference frame.
For general applications, a concise and effective approach is to employ the HMD tracking coordinate as the reference frame, as depicted in Fig. 5(b). Since the camera is fixed on the HMD, the camera-to-tracking transformation can be shared with the HMD tracking system, represented as the H transformation, accompanied by a fixed offset transformation denoted as D. While the virtual objects (such as waypoints) are defined by the transformation F based on the user’s perspective, the accuracy of the relative transformation A may be compromised, resulting in misregistration and an inadequate representation of virtual objects.
This visualization holds significant importance in AR-based robot programming applications within the context of digital twin and intelligent manufacturing. In such applications, the robot’s programmed motion is determined by the C translation between the virtual robot base and virtual waypoints, which are represented in the robot space. It is worth noting that the accuracy of this translation relies on the F and G transformations, with G being dependent on the I transformation and F being influenced by the camera’s extrinsic calibration. Consequently, the precision of robot programming is directly impacted by the accuracy of the visualization.
Hence, achieving dynamic registration of camera-to-tracking is crucial, and this transformation can be propagated through two separate transformations, D and H. Due to the H transformation that can be obtained from the HMD tracking system, it is imperative to conduct a pre-calibration for obtaining the HMD-to-camera transformation, which serves as a fixed offset transformation when the camera is mounted on the HMD. Although this problem could be solved by the manufacturer integrating the camera and HMD, it still remains a problem for applications if the camera cannot be integrated into the HMD. For example, as shown in Fig. 1(b), in the AR-based teleoperation system, the camera is near the robot, which is remote to the user who wears an HMD, and the camera extrinsic error still needs to be overcome. In the scenario of a remote AR system, the camera’s registration with another tracking system is accomplished through the L transformation. By calibrating the two tracking systems with the J transformation, the camera’s absolute extrinsic parameters can be determined. The key transformation and corresponding calibration processes are summarized in Table 1.
Key transformations in the VST AR registration frameworks
Transformation | Description | Nature | Calculation process |
---|---|---|---|
Camera intrinsics | Camera optical properties | Fixed | Camera calibration |
A, B, V | Camera extrinsic | Varying | Registrate to a common reference |
C | objects (e.g., waypoints) in the robot coordinate | Varying | Registrate to a common reference |
D | Camera-to-HMD | Fixed | Manufacturer setting or calibrated |
E | Camera-to-world | Varying | Depend on D and H |
F | Objects-to-world | Varying | Define by users |
G | Virtual-robot-to-world | Fixed | Depend on I |
H | HMD-to-world | Varying | From tracking system |
I | Robot-to-world | Fixed | Manually measurement |
J | Tracking-to-tracking | Varying | Need registration |
K | Other-cameras-to-world | Varying | Depend on J and L |
L | Camera-to-other-tracking | Varying | If the camera is movable, should be tracked. Otherwise, the static transformation should be calibrated. |
Transformation | Description | Nature | Calculation process |
---|---|---|---|
Camera intrinsics | Camera optical properties | Fixed | Camera calibration |
A, B, V | Camera extrinsic | Varying | Registrate to a common reference |
C | objects (e.g., waypoints) in the robot coordinate | Varying | Registrate to a common reference |
D | Camera-to-HMD | Fixed | Manufacturer setting or calibrated |
E | Camera-to-world | Varying | Depend on D and H |
F | Objects-to-world | Varying | Define by users |
G | Virtual-robot-to-world | Fixed | Depend on I |
H | HMD-to-world | Varying | From tracking system |
I | Robot-to-world | Fixed | Manually measurement |
J | Tracking-to-tracking | Varying | Need registration |
K | Other-cameras-to-world | Varying | Depend on J and L |
L | Camera-to-other-tracking | Varying | If the camera is movable, should be tracked. Otherwise, the static transformation should be calibrated. |
In essence, the registration problem boils down to accurately propagating transformations. To simplify the aforementioned registration framework, a key aspect of achieving precise registration lies in the camera-to-tracking transformation, which is the synchronization of virtual cameras with their real counterparts. Figure 6 illustrates the transformation model of the virtual cameras. The base origin represents the virtual world origin, and the virtual objects are positioned in this world coordinate system through the world-to-object transformation. The HMD tracking center is also tracked in the virtual world coordinates, enabling the acquisition of the world-to-HMD transformation via the tracking system. This transformation provides the relative pose of the HMD, including its position and orientation, in the virtual world system. The HMD-to-camera transformation remains fixed, as the real camera is rigidly mounted on the HMD. Significantly, the pose of the virtual camera with respect to the virtual world coordinates is determined by two transformations: world-to-HMD and HMD-to-camera. Each virtual camera then generates a perspective projection of the objects onto an image plane to produce a 2D virtual image. The camera-to-image transformation, which encompasses intrinsic parameters such as the optical center, focal length, field of view, and lens distortions, represents the nonlinear mapping involved in this process. Drawing upon the preceding discussion, we have classified the sources of misregistration between virtual and real images into four distinct categories:
Inaccurate placement of virtual objects in the virtual world: In many AR-based robot programming applications, the entire robot work cell serves as the shared space. Without physical objects or features as registration targets, virtual objects rely on accurate world-to-object transformations. If these transformations are imprecise, the corresponding objects will appear misaligned in the image.
Inaccurate world-to-HMD transformation: This error has two components. First, tracking errors in the tracking system used to determine the HMD’s pose introduce imprecision. These errors depend on the tracking system and distance. Second, misalignment between the origins of the tracking system and the virtual world system further contributes to inaccuracies.
Lack of synchronization between the virtual camera and the real camera’s movement: This error arises from two factors. First, the previous imprecise world-to-HMD transformation leads to an inaccurate virtual camera position in the virtual world system. Second, the relative offset transformation from the HMD to the camera introduces inaccuracy.
Errors in the camera-to-image mapping: The perspective projection of real cameras is typically described using a pinhole camera model with distortion. Inaccurate intrinsic parameters representing this camera model result in the misregistration of pixel positions in the 2D image.

The registration framework simplifies the process of mapping virtual objects in video-based augmented reality systems onto images using a virtual camera
4 Mathematical Model
4.1 Translation Errors Along the Z-X-Y Axis.
![(a) Misregistration error caused by Z-axis translation Kz = [0.1, 0.1, …, 0.1] (left) and Kz = [−0.1, −0.1, …, −0.1] (right). Upper row: virtual content pixel movement direction, lower row: simulated magnitude of movement; (b) error resulting from X-axis translation with parameters Kx = [0.01, 0.01, …, 0.01] (left) and Kx = [ −0.01, −0.01, …, −0.01] (right). Uniform movement magnitude for all pixels; and (c) error resulting from Y-axis translation with parameters Ky = [0.01, 0.01, …, 0.01] (left) and Ky = [−0.01, −0.01, …, −0.01] (right). Uniform movement magnitude for all pixels.](https://asmedc.silverchair-cdn.com/asmedc/content_public/journal/computingengineering/24/3/10.1115_1.4063350/2/m_jcise_24_3_031003_f008.png?Expires=1744114514&Signature=2uqthgewVIxoMAk0mi1Vz-hRMkBpCBTJ9bY4loy-Dn8TKLdkwZvr7bzvEztHXQrh87YofVgLpZqcjiQqDulSCMe2ctRCf9hVhpbatXCtZH6mv7v0hEJ-xjfYKhWohAxxnGL0-82bDCWoWB73R3mbW0TKPYw5xwpxJSs5N1kQrA8bW07b~BuQzFsK9IAaGhvrSvpYbueSg8DDrT4L8Au6Amwgn8V0FdfUIOxwbDAU37bJx~I6MHX5ZmchieELwuoYyfTeQhonJNb0AjKJKhTm8EM1dzGNu6T6-r~vvM-a~w2s5hrtWEmt37JOzXvm6l~9-OBFz9CaaRKgQMqT3vfPFQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
(a) Misregistration error caused by Z-axis translation Kz = [0.1, 0.1, …, 0.1] (left) and Kz = [−0.1, −0.1, …, −0.1] (right). Upper row: virtual content pixel movement direction, lower row: simulated magnitude of movement; (b) error resulting from X-axis translation with parameters Kx = [0.01, 0.01, …, 0.01] (left) and Kx = [ −0.01, −0.01, …, −0.01] (right). Uniform movement magnitude for all pixels; and (c) error resulting from Y-axis translation with parameters Ky = [0.01, 0.01, …, 0.01] (left) and Ky = [−0.01, −0.01, …, −0.01] (right). Uniform movement magnitude for all pixels.
![(a) Misregistration error caused by Z-axis translation Kz = [0.1, 0.1, …, 0.1] (left) and Kz = [−0.1, −0.1, …, −0.1] (right). Upper row: virtual content pixel movement direction, lower row: simulated magnitude of movement; (b) error resulting from X-axis translation with parameters Kx = [0.01, 0.01, …, 0.01] (left) and Kx = [ −0.01, −0.01, …, −0.01] (right). Uniform movement magnitude for all pixels; and (c) error resulting from Y-axis translation with parameters Ky = [0.01, 0.01, …, 0.01] (left) and Ky = [−0.01, −0.01, …, −0.01] (right). Uniform movement magnitude for all pixels.](https://asmedc.silverchair-cdn.com/asmedc/content_public/journal/computingengineering/24/3/10.1115_1.4063350/2/m_jcise_24_3_031003_f008.png?Expires=1744114514&Signature=2uqthgewVIxoMAk0mi1Vz-hRMkBpCBTJ9bY4loy-Dn8TKLdkwZvr7bzvEztHXQrh87YofVgLpZqcjiQqDulSCMe2ctRCf9hVhpbatXCtZH6mv7v0hEJ-xjfYKhWohAxxnGL0-82bDCWoWB73R3mbW0TKPYw5xwpxJSs5N1kQrA8bW07b~BuQzFsK9IAaGhvrSvpYbueSg8DDrT4L8Au6Amwgn8V0FdfUIOxwbDAU37bJx~I6MHX5ZmchieELwuoYyfTeQhonJNb0AjKJKhTm8EM1dzGNu6T6-r~vvM-a~w2s5hrtWEmt37JOzXvm6l~9-OBFz9CaaRKgQMqT3vfPFQ__&Key-Pair-Id=APKAIE5G5CRDK6RD3PGA)
(a) Misregistration error caused by Z-axis translation Kz = [0.1, 0.1, …, 0.1] (left) and Kz = [−0.1, −0.1, …, −0.1] (right). Upper row: virtual content pixel movement direction, lower row: simulated magnitude of movement; (b) error resulting from X-axis translation with parameters Kx = [0.01, 0.01, …, 0.01] (left) and Kx = [ −0.01, −0.01, …, −0.01] (right). Uniform movement magnitude for all pixels; and (c) error resulting from Y-axis translation with parameters Ky = [0.01, 0.01, …, 0.01] (left) and Ky = [−0.01, −0.01, …, −0.01] (right). Uniform movement magnitude for all pixels.
4.2 Rotation Errors Along the Z-X-Y Axis.

The distribution of misregistration directions and the corresponding heat map illustrating the magnitude of misregistration caused by individual factors are presented: (a) misregistration error resulting from rotation along the Z-axis with angles of γ = π/12 (left) and γ = −π/12 (right), (b) misregistration error arising from rotation along the X-axis with angles of α = π/12 (left) and α = −π/12 (right), and (c) misregistration error due to rotation along the Y-axis with angles of β = π/12 (left) and β = −π/12 (right)

The distribution of misregistration directions and the corresponding heat map illustrating the magnitude of misregistration caused by individual factors are presented: (a) misregistration error resulting from rotation along the Z-axis with angles of γ = π/12 (left) and γ = −π/12 (right), (b) misregistration error arising from rotation along the X-axis with angles of α = π/12 (left) and α = −π/12 (right), and (c) misregistration error due to rotation along the Y-axis with angles of β = π/12 (left) and β = −π/12 (right)
5 Calibration of Head-Mounted Display-to-Camera Registration
The primary objective of this work is to achieve global misregistration reduction in video-based augmented reality systems without the use of detectable patterns. Accurate visualization of virtual content relies on the relative transformation between the camera and virtual objects, which are both registered in the same reference coordinate system, as explained in Sec. 3. In essence, the position of virtual content within the reference coordinate system is defined by the user based on the visualization outcome, forming an interconnected process. Figure 5 illustrates the registration framework, while Fig. 6 presents a simplified transformation. It is crucial to ensure precise transformations at each step to achieve the desired level of accuracy in the final registration process.
Among the four error categories mentioned earlier, the errors associated with world-to-object and camera-to-image can be effectively eliminated. This is because the virtual objects maintain static positions in the virtual world which is defined through user perception, and the camera is calibrated with accurate intrinsic parameters through Zhang’s calibration method [54]. However, errors arise in the world-to-HMD transformation due to the employed tracking techniques, which are not the primary focus of this study. For our investigation, we utilize two commercial HMD devices, namely the Oculus Rift S and the first-generation Oculus Quest, both of which exhibit negligible inaccuracies that fall within acceptable tolerance levels. The positional accuracy of the Oculus Rift S HMD, as reported in a previous study, averages at 1.66 mm [55], while the first-generation Oculus Quest demonstrates an average positional accuracy of 6.86 mm [56]. Consequently, based on the derived error impacts from the mathematical model, our main objective is to calibrate the HMD-to-Camera transformation, aiming to minimize global misregistration while considering the existing errors associated with the World-to-HMD transformation.
5.1 Initial Estimation of Head-Mounted Display-to-Camera.

Transformations involved in the calibration process, where the camera is attached to the HMD to capture images of a stationary checkerboard from various viewpoints
However, it is crucial to acknowledge that this calibration step cannot entirely eliminate all system errors. One key assumption made during this calibration process is the absence of tracking errors in the HMD, but it can be challenging to completely eliminate such errors due to variations in tracking systems and distances involved. Furthermore, in practical applications, virtual objects lack physical reference points. The determination of the world-to-object transformation relies on subjective judgments made by users based on visual results. Inaccuracies in the world-to-object transformations can adversely affect the robot-to-object matrix, consequently impacting the execution of robot programming, as detailed in Fig. 5. Therefore, additional efforts are required to mitigate the propagation of global transformations and achieve more precise world-to-object transformations.
5.2 Error Correction of Head-Mounted Display-to-Camera Registration.
An additional error correction is performed on the HMD-to-camera transformation matrix. The six parameters, encompassing translations and orientations, can be adjusted to minimize the initially estimated result obtained from the uncorrected input propagated from the World-to-HMD transformation.
Based on the simulation results presented in Sec. 4, it is observed that certain individual adjustments may yield similar corrections. For instance, translating along the X-axis and rotating along the Y-axis can produce comparable effects. Consequently, the selection of adjusted parameters depends on their uncertainty and sensitivity. Bajura and Neumann [58] discussed the uncertainty and sensitivity of image-space errors, highlighting that the camera’s position has a greater impact on the projection of an image when the viewed point is relatively close to the camera, whereas the camera’s orientation plays a more significant role when the viewed point is further away. This analysis aligns with the findings reported in Sec. 4.
To accommodate a zoom-in or zoom-out effect on the virtual content, it becomes necessary to adjust the translation along the Z-axis. Increasing the Z-axis translation will result in a zoom-out effect while decreasing it will yield a zoom-in effect.
In the case of rotation of the virtual content along the normal passing through the image center, adjustment of the rotation along the Z-axis becomes essential to ensure proper alignment of the axis orientation.
When there is a translation of the virtual content and the observed virtual points are relatively close to the camera, it is recommended to modify the translation along both the X and Y axes of the camera.
In scenarios where the virtual content undergoes translation and the observed virtual points are at a considerable distance from the camera, it is advisable to modify the rotation along both the X and Y axes of the camera.
6 Experimental Evaluation and Result
This section presents a comprehensive evaluation of the mathematical model, considering both qualitative and quantitative aspects. The qualitative evaluation involves the introduction of pre-set individual errors to the calibrated HMD-to-camera transformation. By examining the resulting display output, we can verify the consistency between the mathematical model and the simulation results. The primary objective of this qualitative evaluation is to confirm the observed trends and distributions observed in the simulations. In the quantitative evaluation, we aim to assess the effectiveness of the calibration approach by comparing it with the utilization of the manufacturer’s default settings. This quantitative assessment serves as a measure of the calibration’s performance and its potential benefits over using default configurations.
6.1 Qualitative Evaluation.
In the qualitative experiment, a virtual point grid is registered onto a physical checkerboard, as depicted in Fig. 11. The target consists of a 9 × 7 pattern of circles with a pitch of 20 mm, where each circle has a diameter of 5 mm. The objective is to align the virtual circles with the intersections of the grid on the physical checkerboard and examine the alignment from various viewpoints. Following the calibration process, the virtual points accurately align with their respective positions on the checkerboard, as observed during the viewpoint changes depicted in Figs. 12(a)–12(c).

The registration target in the AR scene, where is virtual circle grid is expected to align with the intersections on the physical checkboard

The first row (a)–(c) shows the registration after calibration from different points of view. Translation errors on HMD-to-camera are introduced in the second row (d)–(f). Zoom-in and zoom-out effects are caused by translation along the Z-axis in (d) and (e). The offset in (f) is due to a combination of translation along the X-axis and Y-axis. Rotation errors are appended in the last row (g)–(i), (g) consequence of additional rotation error along the Z-axis. Rotation along the X-axis and Y-axis also results in offset misregistration, respectively, in (h) and (i).

The first row (a)–(c) shows the registration after calibration from different points of view. Translation errors on HMD-to-camera are introduced in the second row (d)–(f). Zoom-in and zoom-out effects are caused by translation along the Z-axis in (d) and (e). The offset in (f) is due to a combination of translation along the X-axis and Y-axis. Rotation errors are appended in the last row (g)–(i), (g) consequence of additional rotation error along the Z-axis. Rotation along the X-axis and Y-axis also results in offset misregistration, respectively, in (h) and (i).
Subsequently, intentional errors are introduced to the HMD-to-camera transformation. When solely applying a translation along the Z-axis in the camera coordinates, the virtual grid undergoes a positive translation, resulting in a zoomed-in appearance as depicted in Fig. 12(d). Conversely, a negative translation causes the grid to be zoomed out, indicating that the virtual camera is positioned behind the real camera. Additionally, an offset in the virtual grid is observed when adjusting the translation along the X and Y axes, as shown in Fig. 12(d). Furthermore, individual rotation errors are evaluated. When a rotation error occurs along the Z-axis, the virtual content rotates around the image center, as illustrated in Fig. 12(g). Similarly, rotations along the X and Y axes lead to a displacement of the virtual grid, as shown in Figs. 12(h) and 12(i).
The qualitative results obtained from the conducted experiments validate the trends and distribution observed in the simulations presented in Sec. 4. By employing a plane registration target, it becomes evident that the individual factors within the HMD-to-camera transformation contribute to distinct misregistration errors. Furthermore, the observed trends and distribution of these misregistrations exhibit clear distinctions between the separate factors. This not only confirms the effectiveness of the mathematical model but also establishes a solid groundwork for future research aimed at analyzing more intricate error models.
6.2 Quantitative Evaluation.
In this section, we employ a quantitative method to measure the improvement in calibration compared to using the default settings (before calibration). The accuracy of visualization is assessed through frames defined in the AR system, with the ground truth of the frame pose measured through a calibration tool. To define a frame in the AR system, we introduce and implement a modified three-point method, commonly used for defining user frames in robotics.
The traditional three-point method in robotics involves using three reference points: the origin, the X-direction, and the Y-direction, to redefine a coordinate system as a user frame. However, these three points contribute differently to the accuracy of the frame definition. To address this, we apply a modified version of the method to mitigate the influence of individual points.
In the modified three-point method, three points need to be defined in 3D space: one along the X-direction, one along the Y-direction, and one along the Z-direction. And they share the same fixed offset from the origin. It is known that a minimum of three non-collinear points can define the pose of a rigid body in three-dimensional space. These reference points, referred to as landmarks, are equidistant from the origin but located on different axes. Users are guided to define these landmarks using a 3D-printed calibration tool (shown in Fig. 14). The task is simple, where users align virtual balls with the calibration tool based on their visual perception. By determining the position of each landmark, the pose of the frame can be calculated. If the initial registration of the frame is accurate, regardless of camera viewpoint changes, the virtual balls should consistently align with the physical balls. To facilitate repeated experiments, we consider two conditions for camera movement, as illustrated in Fig. 13:

The measurement data are collected under two conditions. In condition 1, the camera stays in the same position and captures images by rotating the view direction. In condition 2, the camera changes position and captures images from different points of view.

The experimental setup of the quantitative evaluation. The HMD with the camera is mounted on the manipulator, which is programmed to change pre-set points of view. Three virtual balls are measured to align with the physical calibration tool.
The camera remains fixed at a specific position while rotating to capture the target in different regions of the image, including the top, bottom, left, and right.
The camera changes its position while ensuring that the target remains centered in the image.
To facilitate comparison, we gather two sets of AR images: one taken before calibration and the other after calibration. In each experimental condition, four images are captured using the left-eye camera. To ensure consistent observation positions, the HMD with the camera is securely mounted on a manipulator, specifically the Aubo i5, as depicted in Fig. 14. The robot manipulator possesses a repeatability of ±0.05 mm and is programmed to maintain consistent observation positions throughout the experiment.
The physical calibration tool used in this study consists of 3D-printed axes with three spheres positioned at the top of each axis (refer to Fig. 14). Each sphere has a diameter of 10 mm. The calibration tool is positioned at a distance of 1 m from the camera. The alignment process involves aligning three virtual balls of the same size with the physical spheres. Initially, three sets of balls are registered from a specific starting point of view using the aforementioned modified three-point method. Subsequently, eight images are captured from two additional points of view.
The registration error is quantified by calculating the pixel offset of the ball center. A comparative analysis is performed using a two-tailed t-test to assess the misregistration in 12 sets under each condition. The results are presented graphically in Fig. 15, demonstrating a significant reduction in misregistration after the calibration process.

Significant differences were observed between the calibration results under two conditions, with p-values of 0.002852 and 0.000015 obtained from a two-tailed t-test for each condition
7 Conclusions
This study focuses on addressing the misregistration issues that occur in video see-through augmented reality systems, which limit the potential for shared workspaces between humans and machines in shop-floor applications. The sources of error are analyzed and presented within a comprehensive system framework. A mathematical model is proposed to characterize the impact and sensitivity of the error specifically in the HMD-to-camera transformation. The model examines the six individual factors comprising the transformation matrix, including three translations and three rotations, and simulates the resulting misregistration effects. To mitigate the HMD-to-camera error, a closed-loop calibration method is introduced and applied in a prototyping system. The calibration process involves the initial estimation and fine adjustment of the HMD-to-camera transformation. Both qualitative and quantitative evaluations are conducted to validate the mathematical model and the calibration approach. The results demonstrate successful global registration between virtual objects and the physical environment.
The limitation of this research is that how the registration accuracy reflects the depth of the view target is unexplored. The points in the shared workspace won’t share a consistent registration error. The distribution of misregistration has yet to be studied. Another future research question is how to achieve accurate registration in a specific application. For example, in the AR-based robot programming task, the virtual world is required to align with the real world via the robot base to share the same robot workspace. The aforementioned potential problems will be prioritized in the future.
Acknowledgment
This material is based upon work supported by the National Science Foundation under Award No. DGE-2125362. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Funding Data
NSF FW-HTF 2222853.
Conflict of Interest
There are no conflicts of interest.
Data Availability Statement
The authors attest that all data for this study are included in the paper.
Nomenclature
In this paper, we use the following nomenclature. Capital letters denote vectors and boldface capital letters denote matrices. It is common to use a specific letter, such as “T” or “M”, to represent a general transformation matrix. Lowercase letters denote scalar values. Given coordinate systems A and B, the transformation from A to B is defined by , where is the transformation. The unit of translation in the transformation matrix is the meter.