Predicting and Analyzing Obstacles Based on Pedestrian Behavior at Crosswalks for driver and blind people.

Subject: Predicting and Analyzing Obstacles Based on Pedestrian Behavior at Crosswalks for driver and blind people.

Research objective: Develop a method for predicting object appearance from vehicle blind spots at crosswalks.

To develop a robust next-action prediction system based on the Vision Transformer algorithm, ultimately improving safety in autonomous and semi-autonomous vehicles. Also, another goal is to empower visually impaired individuals with newfound independence and confidence in navigating their surroundings. Evaluate the performance of the model in both the automotive environment and the environment of visually impaired individuals.

Step 1:

Analysis of trends in image processing technology for autonomous driving behavior prediction.
Analysis of trends in image processing technology for predicting the behavior of visually impaired people.
Analysis of Trends in Camera-Based Trajectory Prediction Systems.
Collect data and analyze deployment trends for hazard detection and feedback.
Develop hazard recognition and feedback system based on deep learning.

Step 2:

Optimize hazard identification system based on image data.
Develop hazard recognition and feedback system based on deep learning.
Test hazard detection and trajectory prediction system in pilot environment.
Develop and validate the performance of the auditory-based feedback system.
Develop and predict the appearance of objects from a vehicle’s blind spot at Crosswalks using vision transformer with Yolo.

Step 3:

Develop integrated hazard recognition and feedback system based on deep learning.
Test hazard recognition system in real-world environment.
Algorithm enhancement.

Research Background:

Road safety is pivotal in advancing Driver Assistance (DA) and Autonomous Driving (AD) systems. The annual global mortality rate due to traffic accidents exceeds 1.3 million individuals, as reported by the World Health Organization (WHO) [1]. Pedestrian protection is a paramount objective of autonomous vehicles. The Automatic Emergency Braking System for Pedestrians (AEB-P) is an exemplary driver assistance system that effectively safeguards pedestrians. The AEB-P system is designed to identify pedestrians along the vehicle’s projected trajectory and activate emergency braking when the driver cannot prevent a potential collision. The pedestrian AEB system mitigates pedestrian fatalities and injuries by minimizing collision speed through avoidance or reduction measures.

Figure 1. Pedestrian Path Prediction for Autonomous Driving

The aim is to devise a technique for forecasting the emergence of objects from the blind spots of vehicles at intersections. Vehicles situated within blind spots elude direct observation, yet pedestrians positioned in front of the driver possess the ability to perceive these vehicles. Pedestrians commonly visually ascertain the absence of approaching vehicles before traversing the roadway, subsequently making decisions and taking actions per their perceptual findings. Hence, the anticipation of vehicles in blind spots can rely on the observations made by pedestrians regarding the vehicles.

Figure 2. Pedestrian Path Prediction for Autonomous Driving

Therefore, objects in blind spots, such as areas outside of the field of view or obstructed views by other objects, cannot be predicted. This approach may not be effective in avoiding sudden accidents since objects in blind spots cannot be observed directly. For instance, it can be challenging to avoid a collision when a vehicle suddenly appears in a blind spot. To mitigate such unforeseen accidents, it is crucial to predict the appearance of objects in blind spots.

Hence, the initial model of our system aims to predict the future states of vehicles and pedestrians, representing their trajectories over time.

Secondly, our system specifies the requirements for predicting pedestrian behavior in autonomous driving and derives relevant metrics to evaluate prediction performance.

Thirdly, our system focuses on developing a method for predicting the appearance of objects from vehicle blind spots at intersections.

Our proposed approach utilizes the Vision Transformer (VIT) to analyze and predict pedestrian behavior effectively. In this context, prediction relies on acquiring information about the pedestrian’s location and actions (e.g., pedestrians stopping at intersections and peering into blind spots)

Need for research
The research holds significant potential for profound and extensive impact, offering the prospect of transformative advancements in automotive safety and enhancing the quality of life for individuals with visual impairments. Primarily, the advancement of a next-action prediction system utilizing the Vision Transformer algorithm possesses the capacity to transform the realm of automotive safety fundamentally. Through precise prediction of the behaviors exhibited by various objects and entities present near the vehicle, this advanced technology holds the potential to mitigate the likelihood of accidents and collisions substantially. Autonomous and semi-autonomous vehicles with such predictive capabilities could operate with enhanced awareness and responsiveness, ultimately making our roadways safer for all.
significant is the potential impact on the visually impaired community. The creation of assistive technologies empowered by predictive analytics and computer vision can provide real-time information and guidance, offering newfound independence and confidence in navigating the world. This research has the potential to break down barriers and open doors for visually impaired individuals, fundamentally improving their daily lives by granting them greater autonomy and accessibility. Furthermore, optimizing object recognition and detection algorithms, particularly the YOLO algorithm, will extend the reach of these benefits to a wider range of scenarios and environments, from urban streets to rural landscapes and adverse weather conditions. Overall, the potential impact of this research extends beyond technological innovation; it holds the promise of a safer, more inclusive, and more equitable future for all.
Every year, over 1.3 million people lose their lives in traffic accidents, and many visually impaired individuals are also involved in accidents for various reasons. The development of such technologies could be beneficial in improving the lives of visually impaired individuals
Blind people are often accompanied by seeing-eye dogs that help them move around and do some practical things. They are, however, exposed to many physical hazards. An image-based hazard detection and feedback system can help them overcome these problems. Such a system can rank risk factors and provide useful feedback to help blind people who do not go out because of anxiety.

State of the art

Hara [5] conducted research on predicting the appearance of vehicles in blind spots. The main idea of the paper is to make judgments based on a person’s gaze in the forward direction. Also, The ViT architecture employs self-attention mechanisms for image processing (Ding et al., 2022). Works by partitioning an image into patches and subsequently flattening these patches to generate lower-dimensional linear embeddings. Next, incorporate positional embeddings and utilize the resulting sequence as input for a conventional transformer encoder. The model is initially pretrained using fully supervised image labels on a large dataset and subsequently fine-tuned on the downstream dataset for image classification (Park et al., 2022).

A noteworthy study by Carion et al. (2020) titled “End-to-End Object Detection with Transformers” explores the potential of ViTs in object detection. It demonstrates that ViTs can outperform traditional convolutional neural networks (CNNs) in tasks like object detection by directly modeling global image context. The versatility of ViTs has also been demonstrated in other domains, such as medical image analysis, where ViTs have shown promise in detecting and diagnosing diseases from medical images. Image classification is an essential task in computer vision, wherein the objective is to assign a label to an image by analyzing its content. In recent years, deep convolutional neural networks (CNNs) such as YOLOv7 have emerged as the prevailing technique for image classification (Yung et al., 2022). The You Only Look Once (YOLO) algorithm, introduced by Redmon and Divvala in 2016, has significantly impacted real-time object detection. In contrast to conventional two-stage detectors, YOLO is a singular-stage detector that employs just one pass through the neural network to directly predict bounding boxes and class probabilities (Redmond et al., 2016). This attribute leads to accelerated inference durations, rendering it appropriate for real-time applications. Recent advancements in transformer architecture have demonstrated significant potential in attaining competitive outcomes in image classification tasks.

ViT models can undergo fine-tuning on video datasets, thereby acquiring the ability to discern and predict actions in videos. Girdhar and Grauman (2021) introduce Anticipative Video Transformer (AVT), a novel end-to-end attention-based video modeling framework. The Action Video Transformer (AVT) aims to predict forthcoming actions within a sequence by focusing on the previously observed video frames.

The AVT model comprises a frame feature encoder and an action anticipation module. The frame feature encoder accepts a sequential input of video frames and produces a sequential output of feature vectors. The action anticipation module utilizes the feature vectors generated by the frame encoder to forecast the subsequent action in the video sequence.

The AVT paper constitutes a noteworthy contribution to the domain of action anticipation. The proposed model effectively captures the sequential progression of observed actions and long-range dependencies, which is essential for this project.

Previous approaches to problems, deficiencies, and solutions

Therefore, it is not possible to predict objects in blind spots, such as areas outside of the field of vision or obstructed by other objects. This approach may not be effective in avoiding sudden accidents since objects in blind spots cannot be observed. Hence, the goal is to develop a method for predicting the appearance of objects in blind spots at intersections. Blind people use a white cane for audial and touch feedback. However, this device requires training and habituation and has many limitations, just like the use of seeing-eye dogs. There are special glasses on the market are not real-time but have limitations that explain when a button is pressed. There is still a need for real-time technology that can detect dangerous objects.

Proposed Methodology

Data Collection

The research will begin by conducting a thorough data collection process customized to meet each environment’s specific needs. Data will be collected from diverse sources, including Kaggle, for the automotive domain. Datasets such as the Waymo Open Dataset and nuScenes offer abundant real-world driving scenarios. Preprocessing will encompass data augmentation techniques to accommodate weather and lighting variations. Additionally, we plan to establish databases like TokyoCrossroads-Static.

Model Development

The research primarily focuses on the creation of a Vision Transformer-based model with the ability to be capable of next-action prediction. The architecture will be tailored to suit the distinctive attributes of the automotive and assistive technology domains. Pretrained Vision Transformer models such as ViT and DeiT will be utilized as a foundational framework, followed by fine-tuning and customization to cater to specific tasks. The training process will utilize transfer learning methods to harness the capabilities of pre-trained models on a large scale. This training will be conducted on high-performance GPU clusters.

The research methodology will be carried out in a methodical manner, incorporating significant tasks that collectively contribute to the development and validation of our predictive model.

Data Annotation and Labeling

Careful data annotation and labeling will kick off the research for automobile safety and assistive technology datasets. Annotators will label data with categories of objects and activities, boosting the model’s ability to understand the context. Furthermore, to address the blind spot issue, we require labeling tasks to determine whether a person in the video looked at the car or not.

Training and Fine-tuning of the Vision Transformer Model

After completing the necessary data preparation, we will proceed with the training and fine-tuning of our model, which is based on the Vision Transformer architecture. Transfer learning techniques will be utilized, leveraging pre-trained ViT and DeiT models as initial models and customizing them to meet the specific demands of the respective domains. The training process will involve multiple iterations to optimize the model’s predictive accuracy.

Integration of YOLO for Real-time Object Recognition

The system will incorporate the YOLO algorithm to facilitate real-time object recognition and detection. Modifications will be implemented to optimize compatibility with our Vision Transformer model. The YOLO algorithm will be instrumental in delivering real-time environmental data for automotive and assistive technology applications.

Evaluation Criteria

The models’ performance will be evaluated using specific metrics designed for their respective applications. In the field of automotive safety, metrics encompass object detection accuracy, precision in action prediction, and recall.

Furthermore, the inclusion of user feedback and usability testing will play a crucial role in evaluating the impact of the technology on the visually impaired community, thereby ensuring that it adequately addresses their unique needs and expectations.

First, for camera-based dangerous object recognition, we acquire a dangerous object database in the outdoor environment and also use the benchmarking data. Fast R-CNN, SURF, and SIFT algorithm is applied to detect the object. Then, Mobilnet, vision transformer, Alexnet, Resnet, etc., are used for object recognition. If time permits, we will create a prototype based on ESP chips or Raspberry Pi. The overall system diagram for the project is as follows.

I’d suggest you include a project timeline or schedule to provide a clear sense of how you intend to progress through the different phases of your research, from data collection to model development and testing.

Having addressed these errors (grammatical, logical, punctuation e.t.c) and providing more clarity and detail in key areas strengthens the proposal and make it more compelling to reviewers and stakeholders.