The model introduced in Section 2 presents a general framework for environment recognition, decision-making, and action execution in automation systems based on neuro-congitive insights about the human brain. The first simulation and validation of this framework was presented in Section 3. In this simulation, the different modules were implemented in a rule-based form (hard-coded rules and fuzzy rules) in order to determine output data based on incoming data. In further development steps, it was then aimed to substitute these rules by approaches that are closer to the neurophysiological and neuropsychological information processing principles of the brain. The result of this research effort was the elaboration of the so-called neurosymbolic information processing principle [3]. The first module to which this method was applied was the recognition module [29]. In later steps, it was also attempted to apply this mechanisms to the action execution module and for the representation of emotions, drives, and desires. An overview of the neuro-symbolic principle is given in the following with focuses on the recognition system and further remarks on the application to other areas.
4.1. Neuro-Symbolic Recognition
In Figure 3, an overview is given about the neuro-symbolic recognition model. Recognition, also referred to as perception, always starts with sensor values. These sensor data is processed in a neuro-symbolic network, which comprises the perceptual memory, and results in the perception of what is going on in the environment. The perception process is assisted by semantic memory and provides output information to the episodic memory and the decision-making modules. The neuro-symbolic network is the central element of the model and is concerned with the so-called neuro-symbolic information processing. Due to length constraints of this paper we will focus only on the description of this module.
The basic information processing units of the neuro-symbolic network are so-called neuro-symbols. To use neuro-symbols as elementary information processing units came from the following observation: in the brain, information is processed by neurons. However, humans do not think in terms of firing nerve cells but in terms of symbols. In perception, these symbols are perceptual images like a face, a person, a melody, a voice, and so forth. Neural and symbolic information processing can be seen as information processing in the brain on two different levels of abstraction. Nevertheless, there seems to exist a correlation between these two levels. Actually, there have been found neurons in the brain which react for instance exclusively if a face is perceived in the environment [30–32]. This fact can be seen as evidence for such a correlation and was the motivation for using neuro-symbols as basic information processing units. Neuro-symbols show certain characteristics of neurons and others of symbols. Analyses of structures in the human mind have shown that certain characteristics and mechanisms are repeated on different levels, for example, afference and efference. This repetition of characteristics is a key element to the concept of neuro-symbolic processing.
In perception, neuro-symbols represent perceptual images—symbolic information—like persons, faces, voices, melodies, textures, odours, and so forth. Each neuro-symbol has an activation degree. This activation degree indicates whether the perceptual image it represents is currently present in the environment. Neuro-symbols have several inputs and one output. Via the inputs, information about the activation degree of other neuro-symbols is collected. These activation degrees are then summed up and result in the activation degree of the particular neuro-symbol. If this sum exceeds a certain threshold value, the neuro-symbol is activated and information about its own activation degree is transmitted via the output to other neuro-symbols. Neuro-symbols can process information that comes in concurrently, within a certain time window, or in a certain succession. Additionally, neuro-symbols can have so-called properties, which specify them in more detail. One important example for such a property is the location of the perceptual images in the environment.
To perform complex tasks, neuro-symbols are combined and structured to neuro-symbolic networks. As archetype for this neuro-symbolic architecture, the structural organization of the perceptual system of the human brain as described by Luria [32] is taken. According to Luria, the starting point for perception are the sensory receptors of different modalities (visual, acoustic, somatosensory, gustatory, and olfactory perception). The information from these receptors is then processed in three hierarchical levels. In the first two levels, the information of each sensory modality is processed separately and in parallel. In the third one, the information of all sensory modalities is merged and results in a multimodal (modality neutral) perception of the environment. In the first level, simple features are extracted from the incoming sensory data. In the first level of the visual system, neurons fire to features like edges, lines, colours, movements of a certain velocity and into a certain direction, and so forth. In the second level, a combination of extracted features results in a quite complex representation of all aspects of the particular perceptual modality. In the visual system, perceptual images like faces, a person, or other objects are perceived at this level. On the highest level, the perceptual aspects of all modalities are merged. An example would be to perceive the visual shape of a person, a voice, and a certain odour and conclude that all this information belongs to a particular person currently talking.
In analogy to this modular hierarchical structure of the perceptual system of the human brain, neuro-symbols are structured to neuro-symbolic networks (see Figure 4). Also here, sensor data are the starting point for perception. These input data are processed in different hierarchical levels to more and more complex neuro-symbolic information until they result in a multimodal perception of the environment. Neuro-symbols of different hierarchical levels are labelled differently according to their function. Neuro-symbols of the first level are called feature neuro-symbols, neuro-symbols of the next two layers are labelled subunimodal and unimodal neuro-symbols, and the neuro-symbols of the highest levels are referred to as multimodal neuro-symbols and scenario neuro-symbols. Neuro-symbols of one level present the symbol alphabet for the next higher level. Each neuro-symbol of the higher level is activated by a certain combination of neuro-symbols of the level below. Concerning the sensor modalities, there can be used sensors, which have an analogy in human sensory perception like video cameras for visual perception, microphones for acoustic perception, tactile sensors for tactile perception, and chemical sensors for olfactory perception. Furthermore, there can be used sensors, which have no analogy in the human senses like the perception of electricity or magnetism. What sensor data trigger which neuro-symbols and what lower-level neuro-symbols activate what neuro-symbols of the next higher level is defined by the connections between them. There exist forward connections as well as feedback connections. These connections are no fixed structures, but they can be learned from examples [15]. Learning allows great flexibility and adaptation of the system, because learning is a process that involves all levels of the network. In the current approach, learning is intended to modify the connections between neuro-symbols, but future approaches will also change the structure of the network itself, thus allowing increased flexibility and creation of new neuro-symbols.
4.2. Neuro-Symbolic Implementation and Use Case Description
To verify the concepts of neuro-symbolic recognition, it was applied to a building automation environment. In concrete, the test environment was the office kitchen of the Institute of Computer Technology (ICT) at the Vienna University of Technology [33, 34]. The kitchen comprises a table with eight chairs and a kitchen cabinet including a stove, a sink, a dishwasher, and a coffee machine. For testing the recognition model, the kitchen was equipped with sensors of different types: tactile floor sensors, motion detectors, door contact sensors, window contacts, light barriers, temperature sensors, a humidity sensor, brightness sensors, a microphone, and a camera. From these sensor data, different scenarios had to be perceived following the information processing principles proposed in Section 4.1. As by these measures, the kitchen became an "intelligent" system capable of autonomously perceiving what is going on in it, it got the name Smart Kitchen.
In Figure 5, the neuro-symbol hierarchy for the detection of the three most typical events occurring in the kitchen during working hours is presented: "prepare coffee", "kitchen party", and "meeting". It is shown how level-by-level more and more meaningful and interpretable neuro-symbols are generated from partly redundant sensor data until they result in an activation of the neuro-symbols "prepare coffee", "kitchen party", and "meeting". The redundancy in sensor data allows a certain level of fault tolerance in detection. An activation of a neuro-symbol of the highest level indicates that the event it represents has been perceived in the kitchen.
The event "prepare coffee" is the situation occurring most often in the kitchen and represents the activity that one or more of the employees come(s) into the kitchen, operate(s) the coffee machine, and leave(s) the kitchen again. The detection of this scenario is based on data from the video camera, the microphone, the tactile floor sensors, and the motion detectors. From the floor sensors and motion detectors, it is perceived where in the room a dynamic (moving) object is present. Together with an image processing algorithm analyzing the video data, it is concluded where in the room a person is present. The information from these sensors is partly redundant, which makes the perception more robust. In case a person is perceived close to the coffee machine and the acoustic noise emitted by the coffee machine is detected, the neuro-symbol "prepare coffee" is activated.
The "kitchen party" scenario generically describes a get-together of a number of people in the kitchen for an informal gathering, usually accompanied by food and drinks. Such informal gatherings benefit social networking and the quick exchange of ideas. This scenario is detected from the same sensor types like the "prepare coffee" event. However, in this case, there have to be detected two or more persons based on video data and data from the tactile floor sensors and motion detectors. Additionally, food and drinks on the table have to be identified from the video data and voices from the microphone.
The "meeting" scenario describes a formal get-together for working purposes. It is usually characterized by a number of people that are seated regularly around the table. They have papers or laptops to read and tools to write with them. The number of people talking at the same time is smaller and the overall noise level is lower than in the kitchen party scenario.
The information about perceived scenarios from the recognition module is constantly passed to the decision units. Depending on which event occurs, there are different requirements concerning lighting and heating or cooling. Based on the perceived event and additional sensor information about current temperature, brightness level, position of the sunblinds, and the window status (open/closed), a decision is taken of how to regulate heating, air conditioning, lighting, the position of the sunblinds, and so forth. For the "prepare coffee" scenario, for instance, standard lighting conditions are provided (main light switched on) in case that the outside light is not sufficient. No special adaptations are made in heating or cooling as the person(s) are present in the room only for a few minutes, which is below the time constant of the heating and air conditioning system. Also the "kitchen party" event does not require particular adjustments in lighting. However, while the "prepare coffee" scenario is a spontaneous event the "kitchen party" can be scheduled in advance, since the facility management has access to the room schedule. This is important, because the cooling or heating load is considerable and requires preparation of the room climate. Such a scenario generally lasts about 30 minutes, the impact of (human) heat load depends amongst other factors on the current inside and outside temperature. In the "meeting" scenario, lighting needs special adaptation. In case that the outside light is not sufficient, a light above the table is switched on additionally to the main light. If laptops are used and direct sunlight shines on the screens, the sunblinds are shut down. Adaptation in heating or air conditioning are made in a similar way like for the "kitchen party" scenario.
The Smart Kitchen is a good example for complex interactions between different subsystems that operate in a building or room, respectively. To achieve maximum energy efficiency, the system needs to know about room occupancy. Lighting conditions have to be adapted by electric light and sunblinds depending on outside light conditions and on the activity of the user, for example, when operating the coffee machine, reading journals that are on display in the kitchen, holding a meeting, or coming together for an informal break. The room climate has to be maintained, but only upon occupancy. Since the climate has much longer reaction times than, for example, lighting, the system has to either predict usage [35] or keep climate permanently at comfort level-which is not energy efficient. Instead the system has to operate the room in comfort mode (if it is occupied) or in pre-comfort mode (if unoccupied). In pre-comfort mode, the room can be operated in more relaxed conditions regarding temperature and humidity. This degree of freedom again allows for flexibility in usage of renewable energy sources and cost optimization (e.g., by cooling the room in summer at times when energy from the grid is cheap or when renewable energy is available). Lighting conditions are extremely critical, since human users react sensibly on changes, so the amount of changes has to be kept at a minimum. Furthermore, there is no common lighting level for a room, but it strongly depends on the geometry and obstacles in the room as well as the lighting installation in the room. To maintain a high level of comfort while at the same time optimizing for all other goals (energy efficiency, costs, usage of renewable) is a most challenging task that can be approached satisfactorily by the presented model
4.3. Further Neuro-Symbolic Representations
Similar to the recognition module of the model depicted in Figure 1, the neuro-symbolic information representation and information processing principle can also be applied to the action execution unit for the representation of procedural memory. As described by Goldstein [30], like the perceptual system, also the motor cortex, which is responsible for action planning and action execution, is organized in a modular hierarchical manner. In contrast to the recognition unit, in the action execution unit, the information flow is directed top-down from higher to lower levels. Unlike for the recognition unit, where neuro-symbols receive information from various sources and are only activated if their activation degree exceeds a certain threshold, motor neuro-symbols work the other way around. They have the task to distribute information about a planned action to various sources and therefore activate various neuro-symbols of the next lower level. At the highest level, neuro-symbols represent whole action plans as reaction to a certain situation. Based on this, at the level below, there are activated neuro-symbols in a certain sequence representing different sub-tasks of this action plan. From layer to layer, these action commands become more and more detailed until the last layer comprises neuro-symbols that directly result in the activation of certain muscles and muscle groups in a certain sequence. In technical systems, these muscle activations can be substituted by the activation of certain actuators or the triggering of alerts. Again, neuro-symbols of a lower level are the symbol alphabet of the level above and therefore allow a flexible reuse of defined structures.
Besides recognition and action performance, neuro-symbols can also serve for the representation of emotions as used in the pre-decision and the decision module of Figure 1. In this case, neuro-symbols represent emotional states like lust, anger, panic, fear, hope, pride, and so forth. The activation of these neuro-symbols is triggered from sensory receptors perceiving the internal states of the body, from neuro-symbols of the recognition unit, or from higher cognitive activities. Further details concerning the representation of emotions via neuro-symbols and the structure of such neuro-symbolic networks have already been discussed in [25].
A similar representation for emotions might also be conceivable for drives and desires. Apart from this, it would be interesting to face in a next step the possibility to represent also other types of memory (episodic memory, semantic memory, and working memory) by the neuro-symbolic coding scheme and to investigate how the interaction between all these different neuro-sybmolic representations works in the process of decision making.