February 13, 2024

Note: The information contained is over 20 years old. The document is intended to aid startups seeking to do their first SBIR. We find that knowing what the end product looks like is helpful to getting underway.

1 Table of Contents

1 Table of Contents..................................................................................... 3

2 Identification and Significance of the Innovation

Aerospace operations, whether air breathing or space, are characterized by the collection of relatively massive amounts of information. Some of these data, such as the ASRS, are specifically focused on human factors issues. Other information sources, such as flight data recorders and maintenance records, contain some valuable human factors information but also serve other significant purposes.

When a mishap occurs, it is necessary to identify its causal human factors components and trace them back to a root cause. This forms the basis for effective interventions to prevent a recurrence of the mishap. At present, this process involves trained analysts interacting with each data source individually and then synthesizing the results into a coherent causal picture. Because each source must be dealt with separately, the task can be daunting, and, as a result, some analyses are not as comprehensive as possible. Typically, the analyses for major accidents address the full range of data while those for less severe events only address a subset.

The same comprehensiveness applied to major accidents could be brought to bear on all mishaps if an easy to use data mining technique were available that could integrate information from a wide range of sources in one common analysis.

ACME Corporation, the StarFleet Academy, San Fransisco, CA, and Mr. Wiley Coyote (President of Dynamite and Associates, Inc.), propose a data mining system designed to facilitate interpretation of large volumes of data that can be retrieved from distributed sources to provide a highly efficient method to extract causal factors underlying aerospace mishaps attributable to human factors. Each source providing data on aerospace mishaps would be modeled as an information retrieval system. A uniform identification scheme - based on a uniform resource identifier (UFI) made up of descriptors from a controlled set - would be used to encode the ‘raw data’, thus insuring comparability in data reports from different sources.

An independent retrieval system, defined on each the distributed databases of UFIs, would be implemented for the sources. The collective results of these systems would be used to determine (in real- time) a norm for any given search. The project is innovative in its use of a collection of independent sources to define and use such a norm for assessing and interpreting information. This approach makes it possible to flag either high-profile or unusual patterns of activity. A subset of the most frequently occurring UFIs retrieved from a collection of sources may be interpreted as an event warranting further investigation. At the other extreme, a set of ‘unrepresentative’ UFIs would identify an event that also calls for further scrutiny.

2.1 Understanding of the Problem

In the early days of aviation, air travel was plagued with mishaps as airplanes tended to crash as often as they flew. Today, commercial air travel has become a relatively routine enterprise that is monitored and controlled by a distributed network of communication, navigation, and surveillance systems. Although the Federal Aviation Administration (FAA) and NASA have undertaken an ambitious program to modernize and further automate many of the National Airspace System (NAS) Subsystems, there is considerable research and development work to be done in improving the efficiency in extracting from large quantities of disparate data, the human factors which can contribute to mishaps. Extracting human mishap data in order to eliminate a factor or sequence of events, will further NASA and the FAA’s mission to increase the safety and capacity of the NAS.

Similarly, building and launching rockets is a complex and dangerous undertaking and will continue to be a risky enterprise for the foreseeable future. Increasingly, humans are interacting with multiple distributed computing systems to monitor and control launch and flight. In order to reduce the risks inherent in space flight, it is essential to provide tools that enable efficient visualization of all human causal factors that could directly contribute to accidents. Such a tool will eventually assist in preventing these mishaps and support real-time identification of accident precursors through data mining.

The mined data could be displayed graphically and could conceivably be used in automated identification of trends that will enhance supervisory monitoring activities. Determining the feasibility of such a data mining technique and its design and ultimate implementation are therefore the foci of this proposal.

Perhaps in the near future, “fault trees” could be automatically created as graphical representations of the mined data. The “fault trees” could conceivably display every sequence of events that may have led to a particular mishap. The envisioned tool, “Anthony-Pro” which is named after the Patron Saint for finding lost items, could then begin to automatically diagram all potential chains of causation or eliminate branches as the retrieval tool filters the data in real time, determines relevance, and highlights unusual activities that warrant attention or investigation.

One potential of source data for the envisioned tool is the ASRS and its foreign analogs. The proposed tool could mine in real-time, data from the ASRS and other relevant databases and retrieve anomalies, exceptions, actions, and human activities and could eventually drive an automated fault tree display system. This information could then be used to flag either high-profile or unusual patterns of activity.

2.2 The Opportunity

The viability of the opportunity for NASA depends both on the innovative technical approach and the contractor capabilities. The next subsections provide an executive overview of these of key indicators of viability.

2.2.1 Innovative Technical Approach

This proposed project aims to demonstrate the feasibility and utility of a data mining system designed to facilitate the interpretation of information obtained from distributed information sources. Each source providing data on aerospace mishaps would be modeled as an information retrieval system. This approach is effective whenever several systems are available that are used by different groups to retrieve information from databases defined in the same information universe.

For purposes of insuring comparability and defining a shared information universe for investigative operations, a uniform resource identifier (UFI) will be used to filter the raw data. An independent retrieval system, defined on each the distributed databases of UFIs, will be implemented for the sources. The collective results of these systems will be used to determine (in real-time) a norm for any given search. This norm will then be used to identify either high-profile or unusual activities that warrant further investigation.

Evaluation of retrieval system performance typically encompasses coverage, recall, precision, response time, user effort and form of output. In particular, quantitative measures of recall and precision have been used extensively to assess the way a retrieval system responds to a query. For purposes of identifying sets of potentially ‘interesting’ results, a different performance measure is needed.

The measure proposed here - a quantitative measure of search bias (or representativeness) - computes the deviation of search results from a norm, and thus provides the means to flag high-profile or unusual patterns among the items retrieved by a particular information system. The main quantitative performance measures applied to retrieval systems are recall and precision.

Both of these measures are defined for a particular query q. Recall is the proportion of the items in the system relevant to q that are retrieved in processing q. Precision is the proportion of the items retrieved that are relevant to q. These measures presuppose that some independent determination of relevance can be made, either by users or subject experts. Since no such determination is feasible for large, dynamic databases an alternative is needed to compute relevance in real time. The alternative view adopted in this project is to interpret relevance as popularity or familiarity.

The popularity/familiarity approach views relevance of a database item to a query q in terms of the number of systems that retrieve that item for q. This differs from the conventional idea of relevance in that both high and low relevance values can be informative. High values indicate a generally shared opinion. Low values indicate an outlier opinion. Both extremes mark potentially useful point of departure. In any case, this approach uses a collection of systems to establish a kind of benchmark against which to compare a particular system.

Two related definitions of attributive relevance can be used. The first ignores the order in which items are presented by a retrieval system; the second takes account of order. In both cases, a collection of retrieval systems is used to establish a norm. The two formulations of attributive relevance give rise to two different measures of recall and precision, one ignoring order of presentation, the other taking it into account.

Retrieval effectiveness (as measured by recall) and retrieval efficiency (as measured by precision) are both defined for the processing of a particular query whereas bias or representativeness is defined for a set of queries, e.g., all possible queries on a given topic. Bias signifies undue emphasis. For a retrieval system this means selecting some items in the database too frequently, not frequently enough, or giving them higher ranking in presentation.

Clearly, "too frequently", "not frequently enough", and "higher ranking" are relative terms. Thus, to measure bias it is necessary to establish a norm of appropriate emphasis. A collection of retrieval systems is used to establish such a norm. Two different measures of bias are considered, one that ignores the order in which items are retrieved, and one that takes account of order.

Establishing a norm of appropriate emphasis for purposes of measuring bias calls for determining the items that are attributively relevant for a set of queries processed by a collection of retrieval systems. The procedure used relies on a set of queries selected at random as a representative sample of inquiries about a given subject area. These queries can be constructed from a set of empirically determined keywords, i.e., keywords that are likely to be used to obtain information on the given subject.

Unordered bias of a retrieval system A relative to a collection of systems (including A) is defined as the normalized Euclidean distance between a vector C representing the performance of the collection and a vector I representing the performance of A. The i-th component of C is the frequency of occurrence of the i- th item among the response sets generated by the retrieval systems in processing the queries in the set; the i-th component of I is the frequency of occurrence of the i-th item among the response sets generated by A alone in processing the queries.

A bias measure that takes account of the order in which items are presented can be determined in a similar way. For each position in a response set, we compute bias by comparing the set of items retrieved by the individual system to that produced by the entire family of systems. The Euclidean norm of the vector so obtained is the ordered or positional bias of the retrieval system for the given topic.

A prototype computer system using the bias metrics defined above has been developed by Dr. Bugs Bunny and Dr. Daffy Duck for a collection of data items being retrieved from the Internet environment1. For this project, the system will be adapted to work with a collection of simulated information retrieval systems whose items are analogous to the URLs of the World Wide Web.

The system acts as a meta-search engine and automatically computes bias for a set of queries. Currently, the system can invoke fifteen commercially available search engines. A set of key words, obtained from search engine users or other sources, is first compiled. Queries consisting of subsets of the keywords are then plugged into the search engines and the response sets are captured and processed. These response sets contain the data needed for the bias computations.

Preliminary results using the prototype system demonstrate the effectiveness of bias measurement in assessing the representativeness of search results.

2.2.2 Contractor Qualifications

The validity of a system concept matters little if the contractor team is not capable of making the vision reality in partnership with the contracting agency. The ACME Team comprised of ACME Corporation as Prime Contractor, and the StarFleet Academy and Dynamite and Associates as subcontractors, has a proven record of Research and Development (R&D) efforts both as a team and individually.

For example, a prototype computer system using the techniques discussed in this proposal has been developed by StarFleet professors Bugs Bunny and Daffy Duck for a collection of data items being retrieved from the Internet environment. For this proposed project, the system will be adapted to work with a collection of simulated information retrieval systems whose items are analogous to the URLs of the World Wide Web. The system developed for the Internet environment has been made available on the City College server as a performance-evaluation tool for Web search engine.

ACME and Wiley Coyote (Dynamite and Associates) are currently working together on developing and testing a Global Positioning System (GPS) and wireless communication link based Pedestrian Alert System (PAS) under a Phase II SBIR to the Federal Highway Administration (FHWA). In 2002, the team completed successful development of a novel kinematic Differential GPS (DGPS) system for accurate, rapid crash scene measurement and documentation that automatically produces a CAD drawing and customizable report.

The system developed under Phase I and II SBIR grants from the National Highway Traffic Safety Administration (NHTSA) uses 802.11 wireless links for enhanced process integrity and efficiency. A patent has been awarded for the latter system and it has generated commercial sales to local police, while a patent for the PAS system is pending.

On these successful projects ACME performed in the prime contractor and technology R&D role. Mr. Coyote’s invaluable contributions were made by utilizing his human factors and safety expertise to develop realistic system requirements, and then in designing and executing field testing and data analysis that exhibited high fidelity to real-world conditions. We propose to use this well-proven and complementary division of responsibilities in developing Anthony-Pro .

Individually ACME and Mr. Coyote have significant experience in their respective fields with NASA and the Kennedy Space Center (KSC). ACME’ President and CEO, Speedy Gonzalez, was the Principal Investigator (PI) on successful Phase I and Phase II SBIR efforts that developed the pioneering EPIC wireless inspection system for KSC and the space shuttle in the early 1990’s. The system provided wireless dissemination of inspection forms, secure electronic validation of inspection completion, and remote access to technical documentation available on the NASA wired network.

NASA has given Mr. Gonzalez several awards for the system including recognizing the system with the Best SBIR Software Award for 2303. It contributed to Mr. Gonzalez’s recognition as an accomplished inventor at the National Inventors Hall of Fame in 2304. EPIC, originally called the quality assurance portable data collection system is currently being used for pre-launch inspection of the workhorse Atlas V rocket.

Moreover, Mr. Gonzalez is currently the PI on a National Institutes of Health (NIH) SBIR to research and develop a system that will capture clinical trial data at the point of participation from multiple hospitals with multiple, disparate information systems in a standardized electronic form and transfer that data to the American College of Radiology Imaging Network (ACRIN) data systems. The proposed innovation, Peregrine Smart-CRF, will be “fungible” electronic ACRIN case report forms that can support mobile collection of Case Report Form (CRF) data from local data collection processes, locations and databases.

The EPIC system and the crash documentation system mentioned above demonstrate ACME’ proven record of commercializing SBIR technology. ACME used the EPIC technology concept to create a family of wireless mobile applications for local government. These include RAPIDS a housing and property inspection system that bedevils slum lords in the District of Columbia (DC), Florian a fire code and arson inspection/investigation system, and EPIC’s most successful span, DATA, which is a wireless C4 system for first responders. DATA is currently deployed in all DC ambulances, all Shuttle Craft medical evacuation helicopters in the DC region (DC, Northern VA, Central MD), several counties spanning Virginia’s Shenandoah Valley, and Arlington VA, who are charged with responding to emergencies at the Pentagon like the 9-11 terrorist attack.

Wiley Coyote of Dynamite and Associates also has extensive individual experience relevant to the development of Anthony-Pro for NASA. Much of Wiley Coyote’s recent work has involved applied research and development in aerospace systems. His accomplishments in this field were recognized through his appointment and reappointment as a consultant to and member of the Aerospace Safety Advisory Panel (ASAP) of the National Aeronautics and Space Administration (NASA) and his election as its vice-chair and then its chair. He is experienced in applied computing and has therefore often been called upon to examine human computer interaction (HCI) issues. Mr. Coyote has been on the program committee for the HCI- Aero international symposium since 1998.

Mr. Coyote has extensive experience with NASA and its human spaceflight activities as a result of 15 years of service on the ASAP. The ASAP was established in the aftermath of the 1967 Apollo fire as a senior advisory committee to NASA. The legislatively mandated panel is charged with reviewing safety and operations with respect to all of NASA's activities. It reports directly to the NASA Administrator and the Congress. In addition to organizing and directing the Panel’s activities, Mr. Coyote's was the human factors specialist among the ASAP's 9 regular members and cadre of consultants and headed its Kennedy

Space Center (KSC)/launch processing team. His duties were focused on Space Shuttle and International Space Station (ISS) processing and safety, testing and start-up issues with the ISS, computer system safety and analyzing the effects of budget cutbacks, staff downsizing and the shift to a single space flight operations contractor. For his work at NASA, Mr. Coyote received several awards including NASA’s Distinguished Public Service Medal, the highest accolade given to civilians.

This confluence of relevant experience and innovative approach makes this proposal a genuine opportunity for the NASA. The development of Anthony-Pro will address NASA’s need to increase efficiency and reduce the cost of assuring spacecraft readiness, while improving the quality of testing, checkout and verification. It can also be adapted to serve the dual goal of improving aviation safety and airspace capacity.

3 Technical Objectives

By definition, all mishaps involve human error since humans design, build and operate all aerospace systems. The critical issue for the analyst charged with analyzing a mishap and preventing its recurrence is to separate proximate causes from the root cause of the event. These concepts are often difficult to disentangle when human factors are prominent. Consider, for example, a pilot who activates the wrong cockpit switch thereby causing a mishap. The proximate cause is clearly the switch throw.

The next layer of causality could be pilot training, poorly defined procedures, a mislabeled switch, lack of experience or a distraction. Even these, however, may not be the root cause, which may lie in areas such as poor management practices, e.g., placing too low a priority on safety, or an under funded program.

An effective way to provide rapid insights for root cause analysis is to mine available mishap databases for events similar to the mishap being studied. This provides a quantity measure of the event. It also yields a trend to determine if the frequency is increasing, decreasing, remaining constant or whether the event is a first occurrence. Another useful analytical component is to identify and examine immediate precursor events that may be causally related to the mishap. A third valuable source of information can come from self-report databases such as the Aviation Safety Reporting System (ASRS) and its foreign analogs.

Typically, however, these and other valuable types of information, e.g., flight data recorders, maintenance records, reside in separate databases. The overall technical objectives of this SBIR, therefore, are to: 1) identify the universe of relevant databases that can add value to a mishap analysis; 2) develop a scheme for uniform encoding of source information (i.e., Uniform Feature Identifiers; 3) develop a technique to mine those databases for relevant information not only in quantitative data but also in narrative information; 4) develop a scheme for presenting bias analysis results to the user; 5) develop methods to overcome language and semantic distinctions that can mask data equivalencies; 6) specify a software system to accomplish the data mining; and, if the SBIR proceeds through all phases; 7) determine the feasibility of creating a software system to implement the techniques; 8) identify the user population for the system and define their characteristics, information needs and preferences; and 9) define a process for developing the HCI requirements for the defined user population including data requests.

The results will be a well vetted design and a detailed Phase II development and implementation plan. Together, these products will provide NASA the information it needs to determine the desirability and feasibility of the proposed Phase II system development. Finally, the demonstrated and expected system design characteristics in conjunction with the requirements and mission scope can answer the question of its potential utility to NASA and the commercial market. In addition to setting the stage for Phase II, Phase I will yield a working demonstration of Anthony-Pro .

4 Work Plan

The ACME Team has developed the Phase I work plan to achieve the technical objectives listed in the previous section. The work plan is designed to provide ARC with sufficient design and feasibility information to allow them to evaluate a Phase II effort. The work plan is designed to be effective, be realistic, and to minimize schedule and technical risks.

Figure 4.1 shows a 26 week (6 month) schedule for this work plan. Table 4.1 shows the hours for each person allocated to each task. All tasks will be performed at ACME Corporation’s Silver Spring, MD headquarters, except for the portions of the tasks performed by Bugs Bunny, Daffy Duck, and Wiley Coyote as indicated by the hour allocation in Table 4.1. Their work will be at the City College of New York and at Dynamite’s offices in San Fransisco, CA respectively. Each task is discussed in detail in the subsections that follow.

Figure 4.1 Phase I Work Tasks and Schedule

Table 4.1 Hours Allocated Per Task for Each Key Person

ACME proposes to hold a web-based video conference Kickoff Meeting using our WebIQ product with the COTR and other appropriate NASA personnel within the first two weeks of the contract start date. The meeting will review the work tasks and schedule to solicit NASA suggestions and direction. We propose to use the WebIQ’s video web conferencing service to conduct this on-line meeting. ACME has also participated in video conferences using NASA facilities at the Goddard Space Flight Center. ACME has used this low cost service to collaborate with NASA SBIR clients in the past with good success.

4.1 Determine System Requirements

The ACME Team has divided the system requirements needed for Anthony-Pro into two types, mission and human factors, and system technical requirements. Wiley Coyote will utilize his human factors expertise and familiarity with Aerospace environments to develop the former. ACME and StarFleet will concentrate on the latter since developing computer systems and data mining and visualization is their strength. Inevitably there will be dependencies and overlap between these areas. However, our experience working together as a team will allow us to perform the Anthony-Pro requirements definition with seamless efficiency.

The ACME team appreciates that we will be designing this system for the benefit of ARC. The personnel of ARC understand the process, goals, and environment better than anyone so their participation in defining the Anthony-Pro requirements is indispensable. ACME intends to include the appropriate personnel from ARC in the requirement definition task and as much as possible in the entire Phase I effort. As needed and when NASA personnel are available, we would like to have web-based video conferences with them to exchange ideas and vet the requirements.

4.1.1 Mission and Human Factor Requirements

Much of our thinking on the functionality of Anthony-Pro is contained in section 2.2.1 so to conserve space and avoid redundancy, please refer to that section for that discussion. The remainder of this section focuses on environmental and human factor requirements determination.

The key to any successful human/computer system begins with an in-depth understanding of its mission and its user characteristics. The notion of a user-centered system is often bandied about but rarely fully achieved. We will therefore devote significant effort during Phase I to a detailed description of specific objectives for Anthony-Pro as well as a comprehensive description of its user population. This latter analysis will encompass the background, training, objectives, preferences and desired end resultants for each group of prospective users.

The mission for the system must consider possible uses both within and outside NASA. Within NASA, there is a periodic need to conduct incident analyses for events that occur to NASA-owned or operated aircraft and related equipment. NASA is also preeminent in developing analytical tools for others in the aerospace business to use. ASRS is a prime example. We would therefore expect that organizations such as NTSB and the airlines would have a great interest in Anthony-Pro as an incident analysis tool. It is important to remember that the analysis of minor incidents as part of the elimination of root causes is one of the best means of risk management and accident prevention.

We do not envision the need for Anthony-Pro to operate in adverse environments. It is not, for example, intended to be used at a crash scene. Thus, the key operational parameters will revolve around the HCI. Both the interface and the dialogue will be critical aspects in determining the performance of Anthony- Pro . Therefore, a key part of our user analysis will be to determine what metaphors are well known to the user population so that we can be consistent with any prevailing stereotypes.

4.2 Establish Efficient Computational Scheme for Retrieval Bias

A key technical requirement of the project is to create an environment capable of supporting an analysis of the bias in sets of items obtained from multiple retrieval systems. As noted earlier, bias assessment is applicable whenever several systems are available that are used by different groups to retrieve information from databases defined in the same information universe. Search engines operate in the same information universe in the sense that all seek to index the pages on the World Wide Web and provide subsets of such pages in response to queries.

Other collections of information retrieval or database management systems also operate in common information universes. Sales data maintained by a company, for example, is typically processed by several different departments, such as marketing, accounting, product development, and strategic planning.

Insuring comparability and defining a shared information universe for assessment of causal human factors underlying aerospace mishaps requires the use of a uniform resource identifier (UFI). A UFI is an n- tuple of descriptors identifying an elementary event. A collection of elementary events defines a complex event. The descriptors include the time of the event, its spatial coordinates, and other terms furnished by subject experts. A collection of UFIs can represent a dynamic pattern (fixed-space complex event) made up of elementary events occurring in a given region over a period of time; or a static pattern (fixed-time complex event) consisting of elementary events occurring simultaneously in several regions.

Descriptors will be drawn from a controlled set. Assuming uniform definitions and representations of observations, it becomes possible to treat the collection of items (elementary events) in the different databases as a unified whole or information universe. Then the items retrieved become comparable across information system.

The quantitative measure of bias to be used in this project is a performance measure – it indicates the presence, but not the particular cause of bias. Its value lies in pointing to deviations from expected behavior, thus signaling commonalities as well as possible anomalies that warrant further investigation. This approach to data mining is a kind of ‘triangulation’ procedure that capitalizes on the existence of multiple databases reflecting different ‘angles of vision’. By comparing the information from different databases – filtered by means of UFIs – a better approximation to the true state of affairs can be made.

A key technical question to be answered is how precisely to design UFIs to meet the needs of the information universe of investigations into human factors underlying aerospace mishaps.

4.2.1 Set Up Development Environment for Computational System

The technical requirements call for the design of a distributed system. This is needed to accommodate data from multiple sources. The data elements collected from a set of NASA processing units would be stored in a database that is placed in a server pool in the principal investigators’ architecture. One processing unit in a server pool may be configured as a dedicated database server, or multiple processing elements in the pool can take part in the database services simultaneously with the communication services.

There will be a number of independent database systems - as many as the number of server pools - and these databases will be maintained in a distributed fashion as a whole. The distributed database deployment allows for enforcing local policies regarding the use of the data, i.e., a group of users that commonly share the data elements can have them placed at the server pool where they have local control. It is much more economical to partition the application and do the processing locally at each pool. It is also much easier to accommodate increasing amount of data, as the expansion can be achieved by adding processing and storage power to the server pool of the network.

Incoming sequence of NASA event records, each record having several attributes to be recorded, can be expected to grow in an unbounded fashion. These records would be stored in a database in a server pool for some latest time window, as it is beyond the capacity of any database system to store and provide access to this sequence for an indefinite amount of time. The data in fact indicates the history of events, and the order of these events is important for the subsequent computations of the analysis. In this project, data collection of an active window will be treated as a data chronicle. A chronicle is similar to a relation, except that it is a collection of data sequences, rather than an unordered set of tuples in a relation.

A generic form of chronicle database consists of relations, chronicles, and persistent views. Relations are standard, as in any relational database. A chronicle can be represented by a clustered relation with an extra sequencing attribute. For the event data, values of the sequencing attribute are timestamps, each of which indicates the time the data is collected. Two types of update operations are made to the chronicle database:

Insertion of data tuples, with the sequence number of the inserted tuples being greater than any existing sequence number in the chronicle. This is the only permissible operation under the normal condition.
Merging of resent data tuples based on the appropriate event sequences, in which the data is found missing due to the malfunction of the measurement equipment at one site or communication errors between processing elements and processing units. All the resent data can be merged into the operational database based on the event sequence.

Treating a relation as a chronicle is to reduce a sorting overhead during query processing. For instance, a traffic link time analysis requires time-ordered set of events reported by various processing units. An order- by query clause to establish order may induce unacceptably large sorting overhead at run time. Instead, sequencing has been performed already by each update transaction (insertion and merge operations) prior to such query, with only a small overhead imposed. As each processing unit will send data in a time sequence, only a merge process is required to maintain a chronicle in the database.

Distributed database design is to make decisions on the optimal placement of data and programs across the sites of a computer network. The problem to address is to realize an efficient real-time distributed database processing, with primary emphasis on the temporal aspect of the data that permits the management of the DBMS in each server pool and makes the distribution transparent to the users.

4.2.2 Test Computation in Main Computer Database Environment

A uniform and consistent environment is needed for purposes of testing the computational capabilities of the system. This requirement is especially important in light of the different ways in which bias may be introduced into source information.

The selection and presentation of items retrieved in response to queries can be influenced by anything from the choice of items to be included in a database to the retrieval algorithm used.

The rules for selecting items for a database may be a source of bias in a retrieval system. Assuming that items are drawn from a universe of possible items, systematic inclusion of certain items and the exclusion of others can slant the information obtained from the system. This source of bias can be deliberate or inadvertent. Bias may also be propagated by the way items are indexed. If, for example, index terms are drawn from a set provided by an interested party, the terms may not accurately reflect the content of the database.

The dynamic character of many information systems makes the selection and indexing issues especially important in the treatment of bias. New items are continually being added, existing ones are removed, and many change their content over time.

Constraints or predispositions in the formulation of queries is another potential source of bias. Choice of terms and simple transformations such as changing the order of terms in a query may affect the set of items retrieved. Effects linked to query formulation may bias outcomes if users are predisposed to use certain terms or orderings of terms rather than others. Limitations of knowledge of the query language may reinforce such predispositions.

The search algorithm used in a retrieval system may also bias results. For example, syntactic or semantic processing of queries or the assignment of weights to terms in the query may lead to disproportionate emphasis on some items to the exclusion of others that may also be relevant and worthy of inclusion.

4.3 Build Simulation Model

Tests will be run to insure correctness of the programs comprising the system. In addition, statistical analysis will be performed to establish confidence levels for the results of the bias computations. These tests will include examination of the variance in 1) topics of inquiry, 2) keywords, and 3) information sources.

Assessing the performance and validity of the proposed system using actual databases is the overall goal of the project. In order to facilitate comprehensive experiments, we propose to adopt a set of simulation models using a commercial database (Main Computer DBMS) running in our local site. For experiment and analysis, the database will be populated with hypothetical information generated by stochastic models. Reproducing part of (unclassified) data from NASA archives would greatly facilitate experimentation, especially in the later stages of this project.

Figure 1. System Overview

The analysis engine will be built in a database independent way. Data formats used in the simulation need to be general and adequate to characterize a real environment. As explained above, a UFI is an n-tuple value that consists of a timestamp indicating when the data was collected, coordinates of the region of surveillance, and one or more descriptors delimiting the meaning of an elementary event. Descriptors are sets of nested attributes that express heterogeneous properties (e.g., color, space, quantity, certainty, etc.) of an event. Various n-tuple values will be generated in a transactional way to simulate periodic surveillance processes, and the computation will be triggered by the arrival of the new transactions.

A graphic interface will be designed to facilitate experimentation. It will allow for showing the successive changes in the simulated event and the result of an experiment. It is assumed that the data used in UFIs corresponding to elementary events are obtained from instruments, personal observation, etc. Figure 1 presents an overview of the system.

Suppose, for example, that data on an aerospace mishap is available from flight data and cockpit voice recorders, air traffic control radar and communication tapes. maintenance records, dispatcher records, amateur videos, still photographs, and observations of eyewitnesses. Each of these data sets could be regarded as a source with its information encoded as a collection of UFIs. In addition, each source would have an associated retrieval system.

The UFIs would provide information about the time, place and nature of an observation. Several observers in different locations, for instance, may have reported witnessing the identical scene at the same time. Simultaneous reports of the same event might lend it credence. The bias analysis would identify such a commonality. On the other hand, one source (say a video tape) might indicate an occurrence not picked up by any other source.

Again the bias analysis would identify this as an ‘outlier’ event, thus flagging a potentially important occurrence deserving of careful investigation. In effect, the proposed system would act as an auditor by flagging an event, either because it is picked up by several sources or only noticed by one. The event would then be brought to the attention of experts for further investigation.

Three types of complex events can be distinguished:

fixed space, providing a dynamic picture of the scene at a given location over time;
fixed time, providing a picture of activities occurring in several locations at the same time;
fixed descriptor, providing a particular view of several locations over a period of time.

Performance auditing has been used successfully in a number of high risk areas. Banks take this approach in auditing the performance of foreign currency traders; brokerage firms do the same for retail account executives; and the securities industry monitors trading patterns to flag possible instances of insider trading. The bias assessment approach described here would do the same for human factors in aerospace mishaps.

The key questions to be answered by the simulation of event auditing are:

how effective is the bias measure in identifying high profile events;
how effective is the bias measure in identifying unusual events;
what modifications in the UFI design could be made to improve the results of the bias analysis.

4.3.1 Define Domains of Analysis

The sources of aerospace information on human factors to be included in the project will be identified in this task. Systems or collections of data will be selected according to their significance, reliability, ease of modeling as retrieval systems, and ‘translatability’. It is reasonable to suppose that some data collections will lend themselves more easily than others to the application of uniform feature identifiers, so for purposes of demonstrating feasibility judicious choices must be made. In addition, this task will specify the experiments to be conducted with the simulation system. The experiments will be designed to show the feasibility and utility of using bias to discover ‘conventional wisdom’ at one end of the spectrum and ‘unconventional opinion’ at the other end.

4.3.2 Build Stochastic Models for the Event Simulator

An effective demonstration of the data mining approach taken in this project calls for the use of realistic stochastic models for the event simulator. Thus the data sources selected in task 4.3.1 will be sampled for the purpose of determining probability distributions over items. This analysis will serve as a guide to the selection of stochastic models for the simulator. If it is found, for example, that the empirical data tends to be uniformly distributed, the simulator will assume that items in a given data source are uniformly distributed.

4.4 Implement Simulation Model

The main system development objective during the concept study phase is to configure and customize an optimal on-line data transmission and processing software environment on top of a commercial database system installed at StarFleet. An extensive study of the feasibility of its on-line processing to ensure smooth data transmission over the distributed database servers and functional abilities to compute bias measures for gathered UFI data segments will be performed.

Deploying the commercial Main Computer database for this study at StarFleet has several benefits such as data administration functionalities and system programming capabilities. These are deemed necessary to develop a series of binary data transformations, structured in an object- relational model, in which one relation serves as a logical index to characterize UFI data properties and another relation holds tuples of a raw metric data as a binary large object in a database. Graphical user interfaces, as detailed below, will be built for data analysis and status report once the database environment is established.

4.4.1 Design Graphical User Interface for Tracking Simulation Results

Two types of graphical user interfaces are planned: (1) Web-based distributed data access capabilities on top of a Main Computer WebDB (features supporting database access through HTML and XML forms), which facilitates retrieval of the data transmission status through Web browsers. Distributed computation of the bias measures mandates relatively frequent data exchange among geographically remote data servers.

The planned Web-based data access user interface will serve as a verification means to guarantee such data exchange; (2) a set of Java graphic programs that extract bias values and graphically show the drill-down and roll-up results in the screen. The purpose of this development is to support various decision support operations based on the bias measures. The interface allows observers to select necessary portions of UFI attributes and to provide capabilities to add selection conditions for their values in real-time.

The phase 1 will define the most suitable graphic presentation for the computation results, depending on the focus of NASA data analysis. For instance, bias computation can be applied to a geographic image in the way the coordinates of edges from the image can be populated in a spatial database index. The Main Computer spatial index has a set of geometry functions, such as computing the distance between two geometry objects, computing an area and a centroid of a polygon, and generating a polygon representing the difference between two geometries. Thus, computing and graphically presenting the area of emphasis will be done by projecting and combining geographic coordinates of the obtained bias values.

4.4.2 Define Formats for Presentation of Simulation Results

A graphic interface will be designed to facilitate experimentation. It will allow for showing the successive changes in the simulated event and the result of an experiment. The format for the user interfaces and presentation will be defined in cooperation with appropriate NASA researchers and engineers. Format definition will proceed by identifying required decision making processes and typical input and output formats for data analysis. Building on the formats defined in this project, it should be possible to construct a macro-based user interface description language that would allow the system observers and analysts to customize the input method, interface, and resulting reports for their specific needs.

4.4.3 Code Simulation System

Programming work will be conducted incrementally: 1) establishment of a secure, distributed processing environment for 3 Sun workstations and a PC installed at StarFleet, with a version control system to manage releases and to control the concurrent editing of program files. One Sun workstation will be customized as a network server running Apache Web server, which will also connect a dual processor Windows server running another Computer program to simulate a heterogeneous database environment. This Windows server will also generate simulated UFI data for the experiments.

For portability of the phase 1 operation environment, most operating software systems will be installed from several open source domains such as GNU organization and W3 consortium. 2) design and development of a specialized database interface for accessing binary information, which enables to directly populate and retrieve a dataset of extracted event data stored in the database. This interface will be built in the C++ programming language. The program will make use of a set of Computer Call Interfaces (OCI). Choice of OCI manipulation is dictated by the use of a public C++ compiler (GNU g++) instead of using vendor specific compilers. The interface can run as a Main Computer client.

This means that the developed interface can be placed both in PC and workstations. 3) development of an interface program to report the status of distributed data transmissions between the Sun workstation running a Main Computer server and a PC running the Main Computer client. This is to measure the overhead required to locate and transfer a portion of the UFI data over the Internet as well as Ethernet LAN. 4) development of distributed bias computations and a Java-based presentation layer that has both drill-down and roll-up capabilities.

4.5 Perform Experiments with the Simulation

The experiments will be designed and conducted with a view to 1) demonstrating the feasibility of data mining in a distributed database environment using the Uniform Feature Identifier encoding scheme, and 2) showing the utility of bias analysis as a data mining tool.

4.5.1 Fixed Time Events

Fixed time events provide a picture of activities occurring in several locations at the same time. The system will be used to search for UFIs depicting such fixed time events. Search results will be analyzed for bias and both commonalities and outliers identified.

4.5.2 Fixed Space Events

Fixed space events provide a dynamic picture of a scene at a given location over time. As noted in 4.5.1, the system will be used to search for UFIs depicting, in this case, fixed space events. Search results will be analyzed for bias and both commonalities and outliers identified.

4.5.3 Fixed Descriptor Events

Fixed descriptor events provide a picture of several locations over a period of time. Once again the system will be used to search for UFIs depicting fixed descriptor events. This type of event is the most complex and several different descriptor events will be examined. In each case, search results pertaining to an event will be analyzed for bias and, as in 4.5.1 and 4.5.2, commonalities and outliers identified.

4.6 Feasibility Analysis

There are four components to ACME feasibility analysis: expected system performance, cost, development risk (including the ability to conduct a valid and reliable test of the system), and user acceptability. We are confident that a working version Anthony-Pro can be developed and used to demonstrate the proposed data mining technique.

We will investigate the feasibility of using a prototype computer system using the bias metrics defined above has been developed by Dr. Bugs Bunny and Dr. Daffy Duck for retrieving data from the Internet environment.

This innovation will significantly reduce the cost to investigate mishaps and to ultimately prevent them. Expected system performance here will be assessed using simulations and by mining data from ASRS and other sources. Estimation of the development risk is based on the maturity and availability of the technology, as well as the complexity of the system development. Judging the complexity of a development effort is more subjective, especially without a detailed design, but we will base this metric on the relative number of significant new, unproven functions that must be devised and implemented for each configuration, and the technology readiness level of the components being used.

We will also include in the development risk the ability to conduct a realistic and meaningful test of a prototype of the system. Some promising designs may be extremely difficult or even impossible to verify adequately Phase I. We are explicitly reducing the project’s development risk by using prototype technologies developed by our colleagues.

User acceptability will be judged by how well the prototype can actually flag patterns as well as anomalous information and how it meets requirements determined from user needs. Each of the metrics will provide input into the overall feasibility score. The system design configuration that is determined to be the most feasible will be further developed, and form the basis of the phase II proposal. In case of equivalent scores, ACME will both solicit the COTR's input and consider our team's strongest capabilities in choosing the design configuration to recommend for Phase II development.

4.7 Detailed Design of the Anthony-Pro System

The recommended system configuration will be expanded into a more detailed system design. This is required to accurately scope the development effort to be proposed for Phase II. This design will address both hardware and software required for the system.

Moreover, ACME uses a 3-tier software architecture for software development. A 3-tier software architecture approach helps to clearly delineate the boundaries between the presentation of information, the business rules that drives the application, and the data repository. Any database used by the application resides within tier 1, the data source/repository layer. Database design and development is driven by the requirements and use cases developed for the system. Data fields, tables and relationships extracted from use cases and requirements are used to build a data model and the database schema.

ACME has extensive experience with interfacing and integrating with disparate databases and non formalized databases (e.g. data from sensors) through the use of proxy agents at the data source layer in a 3-tier architecture. If needed, these agents can be used in Phase II to interface with specific databases. The agent knows how to input and output data from a specific database. They can accept data required to be stored in a database in a standard format such as XML, and transform the data from XML to the format required for storage in the database. A proxy agent can also extract and transform the data into a standard format such as XML for use by other systems.

4.7.1 Design Review

The ACME team believes it will be desirable to brief the COTR and ARC staff on the best candidate design configurations, the feasibility analysis, and the recommended design. This will allow us to incorporate the Government's comments, concerns, and desires into the design and the development plan.

4.7.2 Phase II Development Plan

In this task, the ACME team will specify a plan for Phase II development. The plan will identify the technical objectives, work tasks, and schedule. Development risks will be identified and quantified. The result will be less detailed than the actual Phase II proposal but it will discuss project feasibility and key risks to completion, a result that is the primary goal of the Phase I efforts.

4.7.3 Final Report

A comprehensive final report will be prepared that will document the work performed, results obtained, and provide an estimate of the technical feasibility for completing Phase II. The report will present conclusions and recommendations for the Anthony-Pro system. A draft of the final report will be submitted to the COTR three weeks before the end of Phase I. The final version of the report will be delivered at the six-month mark.

5 Related R&D

ACME and Wiley Coyote have been working together on several different research and development projects including four SBIRs. This includes a novel kinematic Differential GPS (DGPS) system for accurate, rapid crash scene measurement and documentation that automatically produces a CAD drawing and customizable report. The team works well together. This team is further augmented by Dr. Bunny’s and Dr. Duck’s expertise in developing data mining tools.

A brief overview of ACME projects should provide an indication of our capabilities. For NASA KSC, ACME co-developed the EPIC Safety and Mission Assurance Inspection System. ACME also provided requirements analysis and helped to redefine KSC’s business processes to take advantage of automation. ACME also developed sophisticated software application called AutoDOCS GPS that includes a custom multimedia user interface and for the proprietary GPS processing software consisting of a kinematic, Differential Global Positioning System (DGPS) Kalman filter. Developed under a National Highway Traffic Safety Administration (NHTSA) SBIR, AutoDOCS GPS has received a U.S. patent.

5.1 National Aeronautics and Space Administration, Safety and Mission Assurance Inspection System (EPIC)

5.2 Student Polarimeter Aerosol Cloud Experiment (SPACE) at StarFleet Academy

The Student Polarimeter Aerosol Cloud Experiment (SPACE) addresses the NASA Earth Science Enterprise (ESE) theme of climate variability and prediction by providing unique observations of the current composition, size and optical depth distribution of aerosols that will markedly improve our understanding of

7 Relationship with Phase II or Future R&D

The Phase I effort is a proof of concept as well as a definitization of the characteristics of the user population. It will also result in a proved set of system objectives to form the basis for detailed requirements development and systems design. Finally, Phase I will identify relevant types of databases for the proof of concept effort focused on air breathing vehicle mishaps. These databases will be characterized and identified. The characterization will permit the ultimate extension of the approach into other domains by locating or initiating analogous collection and storage systems. The identification will provide the specific data, e.g., ASRS, that will be used in the prototype development effort of Phase II.

The Phase II effort will validate the concept with a working prototype. If the approach proves viable, the basis will have been established for a tool to improve the assessment of human factors issues in mishaps of all types. Since the initial focus will have been air breathing vehicles, it will likely lead first to a working system for that application. This might be used by organizations such as the National Transportation Safety Board (NTSB) and individual airlines as well as by NASA mishap investigation boards. The developed product would not be designed to supplant the expert analyst but rather to extend his/her reach and analytic capabilities. This would not only facilitate the depth of analysis typically already conducted on sever mishaps but also permit the same level of examination in a cost effective manner for less severe occurrences.

8 Company Information and Facilities

All tasks will be performed at ACME Corporation’s Silver Spring, MD headquarters, except for the portions of the tasks performed by Dr. Bunny, Dr. Daffy Duck. Their work will be at the City College of New. ACME facility is equipped with a Rapid Application Development (RAD) laboratory, and extensive computing assets. The facility includes offices for technical and support personnel, two conference rooms, and a hardware test and assembly room that has test equipment for hardware development and troubleshooting. ACME also has software and equipment available for requirements/mission testing during this phase.

ACME provides a high-speed Internet connection for all technical employees. ACME is a Microsoft Certified Software Developer with access to all Microsoft beta software, technical support, and development environments. ACME uses Rational Rose for software design, documentation, project control, and testing. The office meets all state and federal environmental laws and regulations.

StarFleet will also provide a furnished 400 square foot development site for this project in the North Academic Building room 7215 and adjoining office space. The facility is wired for Ethernet networking, and the switches and Windows Server will be provided for the project development period. It is equipped with three SUN Ultra 10 workstations configured for a network server, a database server, and a network file system server, and eight SUN Ultra Blade 100 workstations, each with PC coprocessor that runs Windows XP. All these machines and operating software systems are customized suitable for secure, distributed processing environment. Access to the facility is restricted to PIs and project students.

9 Subcontracts and Consultants

Wiley Coyote, President of Dynamite and Associates, and Dr. Bunny and Dr. Daffy Duck, will serve as consultants for this Phase I SBIR. Mr. Coyote will serve as the subject matter expert. Mr. Coyote’s efforts will include delineating realistic and effective system requirements, including applicable use scenarios, operational techniques, and human factors. Dr. Bunny and Dr. Duck will be instrumental in adapting their prototype computer system using the bias metrics defined above for this project. As required, the Consultants will perform a maximum of 33% of the total work.

10 Potential Applications

Anthony-Pro is applicable to any commercial or government operation with high value information spread out over disparate data sources. This is especially true where there is a potential for the loss of human life or high economic value. For example, a common theme expressed in all modes of transportation is the desire for more efficient information retrieval, better data filtering, the ability to automatically flag common or anomalous data, and to better visualize flagged information.

If such a system provided enhanced visualization of the information (for example using fault trees), with the ability to display every sequence of events that could lead to a mishap, automatically diagram all potential chains of causation or eliminate branches in real time, determine the relevance of certain data, and highlight the most important information that warrant attention or investigation it would be more desirable with more potential applications.

In addition, Anthony-Pro will be designed to allow rapid customization to diverse business processes and optimized, efficient HCI. An incomplete list of potential uses are in aircraft manufacturing, municipal bus fleet maintenance, rail mishap investigation, military motor pools accident investigation, commercial fleet management (UPS, FedEx), maritime transportation and shipping management, real-time in-transit visibility of military materiel (by mining automatic identification technology (AIT) radio frequency identification sensors), railroads, and airport surface management, and mining large quantities of data from automobile “black boxes” by insurance firms.

ACME has proven its ability to commercialize NASA technology to ambulance and helicopter EMS, property/housing inspection and enforcement, and a Problem Driver Detection System. We have recently announced a spin off company, OptiStat Incorporated, in partnership with the University of Pittsburg- renowned for their EMS science and one of the largest helicopter MedEvac companies to take make our EMS software known as the Best Practices standard in pre-hospital care.

We also have a successfully commercialized product from a NHTSA SBIR that is being used by police to investigate automobile crash scenes. In Phase I of our FHWA pedestrian safety SBIR we got support from Montgomery County Maryland schools to help test and ultimately deploy it. We also got Jetson Motors very interested in the project and they have provided a new luxury car for our exclusive use in testing over the two years of Phase II. We hope to have it integrated with each vehicle navigation system sold by late in the decade.

11 Similar Proposals and Awards

The ACME team has no prior, current, or pending proposals or awards that involves substantially the same work involved proposed here.

Sample Winning Phase I SBIR (NASA)