Design documents and design project footprints accumulated by corporate information technology systems have increasingly become valuable sources of evidence for design information and knowledge management. Identification and extraction of such embedded information and knowledge into a clear and usable format will greatly accelerate continuous learning from past design efforts for competitive product innovation and efficient design process management in future design projects. Most of the existing design information extraction systems focus on either organizing design documents for efficient retrieval or extracting relevant product information for product optimization. Different from traditional systems, this paper proposes a methodology of learning and extracting useful knowledge using past design project documents from design process perspective based on process mining techniques. Particularly different from conventional techniques that deal with timestamps or event logs only, a new process mining approach that is able to directly process textual data is proposed at the first stage of the proposed methodology. The outcome is a hierarchical process model that reveals the actual design process hidden behind a large amount of design documents and enables the connection of various design information from different perspectives. At the second stage, the discovered process model is analyzed to extract multifaceted knowledge patterns by applying a number of statistical analysis methods. The outcomes range from task dependency study from workflow analysis, identification of irregular task execution from performance analysis, cooperation pattern discovery from social net analysis to evaluation of personal contribution based on role analysis. Relying on the knowledge patterns extracted, lessons and best practices can be uncovered which offer great support to decision makers in managing any future design initiatives. The proposed methodology was tested using an email dataset from a university-hosted multiyear multidisciplinary design project.
Introduction
In the information age today, the advancement and widespread application of information management systems [1,2] that use textual databases to organize information have been archiving vast amounts of digital design documents at various stages of product design projects. Examples include customer requirements, computer-aided design (CAD) models, emails, chat logs, design forums, test reports, customer reviews, and repair reports. As these archival documents have objectively recorded the execution of past design projects, they have become the essential source of design information and empirical knowledge to assist decision makers in better managing future design projects and their corresponding processes. It is, therefore, considered that mining empirical information from historical design documents and reutilizing them in practical design work is one of the most important factors to enable modern enterprises to gain sustainable competitive edge [3].
Most of the existing design information extraction systems have been focusing on extracting product-relevant information from design documents such as CAD models and sketches for product optimization. However, design documents such as emails, conference minutes, and conversation transcripts, which contain invaluable process-relevant information, are mostly underutilized in the context of design information extraction. Different from product-relevant information which focuses on product structures and product functions, process-relevant design information is more about “How a product is designed,” “who executes what tasks,” and “who often work together.” Based on such information patterns, successes or failures of past design projects can be learned and reutilized to support decision making at all stages of product development by means of suggesting promising problem solutions, evaluating possible alternatives, allocating the most suitable resources, and identifying bottlenecks for improvements. A detailed review of process-based design reuse was carried out by Baxter [4]. However, most of the reviewed methods heavily rely on human experience and judgement to construct a process model, which is often error-prone, time-consuming, or virtually impossible due to the length of the project and its breadth in terms of technical capacity and geographic coverage. To avoid or reduce the influence of human involvement, a promising opportunity relies on computational algorithms to automatically uncover critical process-relevant information, e.g., process models, from archived design project documents.
Process mining is a prevailing technique that looks inside the process by automatically extracting business workflow models from event logs recorded by business information systems [5]. Although process mining has been proved as an efficient tool that learns from the past for business process management, creating automatic approaches for mining design process models from archival design documents is still a major challenge for efficient process information reutilization in the context of product design. As most of the design data that record design process executions are semistructured or unstructured texts, traditional process mining approaches that depend on structured event logs become incompetent in the context of design process model discovery. Furthermore, as design processes are usually unpredictable and iterative [6], design task execution is rarely repeated exactly in the same form and manner. Therefore, traditional process mining approaches that attempt to model process behavior in a flat and linear model might produce very huge and complex models for design processes.
In our previous study, we have proposed a layered text mining system which aims to discover process model from design documents recording past design processes [7]. As an extension, this paper presents a methodology for learning critical process-relevant design knowledge from the past via an enriched process mining approach. It intends to become a supporting tool that could help decision makers to improve future projects based on the best practices learned from past design projects. In detail, the proposed methodology consists of two main stages: uncovering the design process model first and then enabling knowledge learning from the uncovered model. The first stage is designed to establish a hierarchical process model from the input design documents using an enriched process mining approach. Different from the existing process mining techniques which deal with event logs and timestamps, the inputs in our study are archived design documents written in a natural language (in this case, English). In addition, by focusing on modeling the documented design process in a hierarchical and modular manner, the proposed approach is able to reduce the model complexity that is caused by the flexible nature of design process. The second stage focuses more on analyzing the uncovered process model so as to enable design information and empirical knowledge learning from multiple perspectives, e.g., the actual execution trace/route via workflow analysis, bottleneck via performance analysis, cooperation via social network analysis, and individual contributions via role analysis. Such studies would help designers making practical and efficient decisions in future design projects. For proposal validation, the effectiveness of the proposed methodology is demonstrated with the help of a case study taken from a real university-hosted design project.
The remainder of this paper is structured as follows: Section 2 reviews related works; Section 3 describes the first stage of the proposed methodology and different methods incorporated within the proposed process mining approach; Section 4 presents a number of statistical analysis methods used in the second stage. Section 5 reports the real-life case study; Section 6 discusses possible extensions and future work; and Sec. 7 concludes.
Related Work
Design Information Extraction.
Design information extraction has its root in linguistics and data mining, with a particular focus on extracting high-quality design information in the form of patterns and terms from design documents by means of natural language processing and data mining. Typical design information includes customer requirements, design rationales, technology trends, product structures, problem solutions, and resource allocation. Through design information extraction, information contained within design documents can be made more accessible for assisting decision-makers in making more doable and efficient decisions, which aim to optimize product design or speed up product development processes.
Due to the easy access of CAD documents, a significant body of the early research work on design information extraction has focused on retrieving and reusing past CAD models by extracting parametric associations. Some of them use color and texture as main features to retrieve similar CAD drawings from an image database [8,9], while others treat CAD models as structured text files and use text mining techniques to represent CAD models as vectors of identifiers [10,11]. Based on the retrieved CAD models, design information relating to product geometry can be reused to speed up product development process via reusing the associated manufacturing processes [12], thus reducing the time required to generate a process plan. However, these CAD-based information extraction and reutilization systems are only suitable for solving geometric problems, which relate more to product structure.
Besides the product geometry embedded in CAD models, there is wealth of nongeometric design information embedded in other types of design documents such as patents [13], customer reviews [14], repair verbatim [15], production configuration data [3], and communication transcripts [16]. Two significant research streams are discovery of technology trends from patents and extraction of customer opinions from online reviews [13,17]. Both the technology trends and the customer opinions could inspire designers to search and identify solutions for product optimization. For example, in the ISAL (issue, solution, and artifact) system, a series of text mining algorithms were specifically designed to automatically discover design rationales from patent documents for an engineering design purpose. With a similar purpose of market-driven technology innovation, potential product concepts of solar-lighting devices were identified from a collection of domain-specific patents [18]. Another example is automatically translating customer reviews into engineering characteristics for quality function deployment [14].
The literature review on design information extraction indicates that most of the existing works have put attention on information relating to product, but extracting and reusing the information relating to design process has little work. Design process models, as an integrated part of design information, can be reused for decision making at various stages of design processes. Imperative efforts are needed to explore the potential of automatically extracting process relevant information like process models from historical design documents.
Process Mining.
Process mining, also known as event mining or workflow mining, is a general methodology used to diagnose business processes by discovering models (e.g., Petri net, business process model and notation, and event graph models) that describe reality from historical event data [19]. The business process models discovered can be compared with a priori models to check whether reality, as recorded in the event log, conforms to the business specifications [20], or be used for simulation and performance analysis [21]. Traditionally, process mining has been focusing on control-flow discovery; that is, automatically discovering the causal dependencies or execution patterns between activities from enactment logs [22–24]. In recent years, as techniques have matured, process mining has been applied successfully in a wide range of real cases, e.g., shipbuilding industry [25], risk management [26], financial service [27], and healthcare processes [28].
Although traditional process mining techniques are able to discover high quality models from logs of well-structured processes, they usually return “spaghetti-like” models when applied to logs of unstructured processes [29]. The most essential reason is that traditional process mining approaches aim to unify all the behaviors recorded in event logs in a unique and flat model. This strategy is not suitable for unstructured processes, where process executions greatly differ from each other. As a remedy, Gunther and van der Aalst [29] proposed a fuzzy mining to simplify the discovered model with the concept of roadmap abstraction. Maggi et al. [30,31] used a semistructured process scheme referred to as “declarative workflow” to present unstructured processes with a set of constraints that state the rules among activities. Diamantini et al. [32] employed hierarchical graph clustering to identify subprocesses, which reflect meaningful collaboration work practices.
As traditional process mining approaches lack the ability of handling unstructured data, mining process model from natural language texts has been attaining more and more attention in recent years. Sinha and Paradkar [33] utilize use cases as source documents and presented a text analysis approach for semi-automatically transforming use cases into business process models. Friedrich et al. [34] combines the existing tools from natural language processing and augmented them with an anaphora resolution mechanism to generate business process model and notation models from process descriptions. Most of these process mining approaches follow a similar mining scheme, which consists of three steps: syntactic analysis which focuses on tokenization and par-of-speech tagging, semantic analysis which detects actions and actors using semantic dictionaries and knowledge bases, and model generation which discovers sequence flows through predefined signal words. One major limitation of these approaches is that the input text has to describe a model sequentially, and the statements in the descriptions must relate to process model. Furthermore, creating such descriptions requires extra manual efforts.
To summarize, because design process is often flexible, and the execution of a design process is usually recorded in a text-rich format, it has become imperative to research suitable techniques, such as process mining, for the discovery of underlying design process models. Furthermore, to address the difficulties currently faced, such techniques should not only be able to handle the textual data as the evidence left alongside a design process, but also be able to model its process structure which is inherently flexible.
Uncovering Design Process Model From Design Project Documents
Figure 1 depicts the proposed methodology of learning process-relevant information and knowledge from past design documents based on an enriched process mining approach. As shown in Fig. 1, the starting point of the whole system is a set of design documents, which record the process executions of a past design project in natural language format. Based on the design documents, the embedded design process is uncovered and analyzed in two stages: uncovering the design process model and learning from the discovered process model.
The goal of the first stage is to uncover a hierarchical process model, which consists of a hierarchical tree and a set of subprocess models, using the proposed process mining approach. The obtained hierarchical tree decomposes the embedded design process into several functional modules. The subprocess models present the detailed execution traces of the modules in the hierarchical tree.
The second stage aims to distill multifaceted knowledge patterns from the discovered process model via statistical analysis, such as workflow analysis for uncovering task dependencies, performance analysis for detecting potential bottlenecks or irregular task executions, social network analysis for discovering cooperation patterns, as well as resource utilization and role analysis for estimating the degree of individual contributions.
Details about the first stage of uncovering design process model are reported in this section, whereas details on the second stage are presented in Sec. 4.
Constructing Hierarchical Tree.
A top-down clustering approach based on document content is specifically designed to surface a modular representation of the embedded design process. The heuristic is that design documents with similar content are likely to be relevant to the same design task in reality. Therefore, the proposed approach decomposes the input design documents into clusters based on their content similarity. The content similarity of any two documents relies on the overlapping of their topic distributions. Furthermore, in order to get a hierarchical representation of the underlying design process, the decomposition procedure proceeds in a top-down manner, until desired homogeneousness is achieved. As a result, the hierarchical representation is a tree, and each node of the tree corresponds to a functional module, within which more detailed, homogeneous executions could be observed from the corresponding document cluster.
Figure 2 describes the top-down clustering approach for constructing the hierarchical tree in detail. The meanings of some fundamental notions are defined as follows:
is a set of document clusters;
is a set of functional modules, each module corresponds to a subprocess model mined from a document cluster ; and
is a tree that organizes in a hierarchical structure.
The algorithm starts by representing the documents using a set of latent topics. As deep belief networks (DBN) [35,36] have been witnessed to perform well in both learning and fast inferring per-document topic distributions, the DBN-based topic modeling approach [16] is employed to obtain the top distribution for each input document. A typical DBN model is a deep neural network consisting of one input layer of observation, one output layer of reconstruction, and several hidden layers. It transforms a document from a word-frequency vector into a topic-probability vector , where is the number of topics detected from the entire document archive , and is the probability that the jth topic appears in a document.
where is the per-document-topic distribution, is the average per-document-topic distribution of , measures the cosine distance between any two topic distributions, and is the cluster size.
After each decomposition, a set of new hierarchical relations is created in T (shown in line 11). The whole decomposition process can be iterated until all the leave modules in are homogeneous enough, producing a hierarchical structure of the process in .
Discovering Subprocess Model.
The second stage aims to mine subprocess models for the modules in the generated hierarchical tree from the corresponding document clusters. Let be a subprocess model. The formal representation of is a tuple , where is a finite set of fundamental elements that point to physical objects, such as people, organizations, tools, and locations, is a finite set of design events which consists of several physical objects in , and defines the potential sequence in which design events have been executed. Based on this definition, the procedure of subprocess model discovery is further decomposed into three steps: meta-information extraction for , event discovery for , and edge detection for .
Meta-Information Extraction.
The focus of meta-information extraction is extracting special writing expressions, called as named entities (NEs), from the inputted design texts. The extracted NEs might point to some physical objects that had been involved in the target design process. In this work, seven types of process relevant NEs are considered, namely, tasks/activities (TE), timestamps (SE), persons (PE), organizations (OE), locations (LE), input/output information (IE), and techniques/tools (ME). A hybrid named entity recognition (NER) approach is proposed to recognize the above NEs from texts. By treating all the noun phrases (NP) as candidate NEs, the proposed NER approach first generates a small set of seed NEs from the candidate NPs via rule matching. Next, through learning more complex features from the seed NEs, the proposed NER approach expands more general NEs from the remaining candidate NEs. The motivation for using the integration of rule matching and machine learning is to keep human intervention at the minimum.
Seed entity generation.
The rules for matching seed entities are created based on the concept of “speech acts.” Originally, “speech acts” are defined as “illocutionary” verbal utterances that have a performative function to present a speaker's intentions, such as promising, ordering, requesting, and inviting [23]. Based on this theory, this paper considers that noun phrases associated with special verbs or nouns might point to process relevant objects with high confidence. Examples of verbal utterances include “submit,” “complete,” “use,” and “software.” This paper calls such verbs and nouns as speech act words. Each entity type owns a speech act dictionary and a set of matching rules formed on . The preliminary set of speech act words in is manually selected from a set of randomly selected sentences. To get a more general , the preliminary is then expanded by including more synonyms using WordNet,2 which is a large lexical database of English.
Give the speech act dictionary , seed NE is defined as noun phrases that are associated with speech act verbs or contain speech act nouns in . Here, the open library, Stanford CoreNLP [37] which provides a set of natural language analysis tools, is used to find the candidate noun phrases in the texts.
Entity expansion.
Based on the seed entities detected, a kernerlized support vector machine (SVM) classifier is trained to retrieve more general NEs from the remaining candidate NEs. As kernel function can save time and effort of explicitly selecting features into the feature space, a string kernel based on surrounding words [16] is employed to measure the similarity between two NEs.
where indicates the ith word in the local context of , is the weights, is the number of words between the ith word and the head word of , and is number of words in the local context of .
where returns 1 or 0, indicating whether and are two identical words, and is the number of words shared in the local contexts of two NEs.
Using the above kernel function, a SVM classifier is trained on the seed entities and applied to predict the entity type of the remaining candidate NEs.
Event Discovery.
The second step aims to detect design events from design documents by identifying the semantic relations among the recognized NEs. Let be the entity types in the first step, a design event is defined as a graph , where
V—each vertex denotes a NE, and the entity type of belongs to ;
—the graph is centered at , , and the entity type of must be TE (task entity);
are the starting and ending time of an event, and their entity type must be SE (time entity);
—each edge denotes a relation between a normal vertex and the center vertex , therefore, .
According to the above definition, a design event can be viewed as a higher-order relation that is centered at a task entity, and all the other related NEs connect to the central task entity directly or indirectly. Two problems here are the number of NEs in design events is varying, and event graphs might overlap on some vertices because design events could share some resources. To tackle the two problems, a graph partition based method of higher-order relation extraction is proposed. Figure 4 shows the workflow of the proposed approach. The basic idea is factorizing the higher-order relation in a design event graph into several binary relations, and reconstructing the design event by finding the maximal cliques around each task entity based on the binary relations.
Direct binary relation detection.
It is the normal case that two entities mentioned in the same sentence are more likely to be related. Based on this assumption, the step of binary relation detection is to find pairs of NEs that are mentioned in the same sentence and have a semantic relation. A rule-based pattern matching approach is used to match binary relations sentence by sentence. The rules include
Rule 1: two entities must be mentioned in the same clause.
Rule 2: two entities are directly connected in the sentence dependency tree.
Rule 3: the type of two entities must be consistent with one of the binary relation types predefined via expert knowledge.
Rule 4: the sentence or clause is in present tense.
Rule 5: no negative words (e.g., don't, not) exists between two entities.
It is worth to mention that rule 3 is designed to eliminate unpractical binary relations that provide no or less significant information for event detection, for example, relations between two location entities and two time entities. In addition, rule 4 is introduced to find design events that are being done or will be done, and rule 5 is for eliminating negative relations.
This step would produce a binary relation graph . The vertices in are the NEs mentioned in a document, the edges in are the binary relations between the mentioned NEs, and the weight of each edge, , is the frequency of the corresponding binary relation.
Indirect higher-order relation detection.
The step of indirect higher-order relation detection aims to partition the binary relation graph by finding the maximal clique around each entity in . Each maximal clique is a candidate event, and maximal cliques around different task entities can overlap on some vertices.
Next, the maximal clique around each task entity is greedily detected by expanding an initial clique in all the directions that the clique density increases. The initial clique is a subgraph in which all the entities have a direct relation to the central task entity. New nodes are added to the initial clique if they are connected to the initial clique and the density of the new clique is larger than the density of the old one. The process of clique expansion stops when no more nodes could be added. By this means, nodes with weak connections to their neighbors would be eliminated, and the number of the entities in an event is determined by the local density of the binary relation graph.
Event selection.
The candidate events obtained in the last step might contain very small cliques that might be noises. To filter the noisy cliques off, the step of event selection first ranks the candidate events detected from a single document by weighting them according to the sum over their edge weights. Next, a cutoff is then used to select candidate events whose weights fall into the top rank. In this way, only maximal cliques that have a larger number of nodes or stronger edges remain as design events.
For simplifying the discovered design event graph, all the paths from the central task node to other indirectly connected nodes are replaced by single edges. The weight of the new edges is the smallest edge weight in the corresponding paths.
Finally, the starting and beginning time nodes of each design event are simply set as the minimal and maximal time indicated by the time entity nodes. If no time entity node is included in a clique, the creation time of the corresponding document is used in place.
Edge Detection.
where is the size of the time window and is the time interval, which is determined by the starting time of and the ending time of . Based on , the relation between and is classified as follows:
: there is no relation if ;
: is parallel with if equals 1, which means two events are executed at the same time;
: is executed following if .
Only when the precedence relation is detected, a corresponding edge is added into the subprocess model. It is noteworthy that by using the time criteria, not only the direct relations and but also the long-distant relation are taken into account as long as and are executed close enough. In this way, two events that are highly related but disturbed by a third event can be reconnected.
Learning From the Discovered Process Model
The second stage of the proposed methodology proceeds to distill multifaceted design information of empirical knowledge patterns from the hierarchical process model using statistical analysis methods. In detail, the discovered process model is viewed in three correlated dimensions, consisting of design tasks/events, personnel, and time. Different analysis methods are applied to individual dimensions or the combination of any two dimensions to mine particular process information that could be helpful in solving special problems. Figure 5 illustrates this analysis process in a three-dimensional model. Referring to Fig. 5, the personnel dimension is the people involved in the discovered process, and the task dimension refers to subprocess models in the hierarchical tree or the small design events that constitute the subprocess models. As shown in Fig. 6, the discovered process model is analyzed from five perspectives: workflow analysis, performance analysis, role analysis, social network analysis, and resource utilization analysis.
Workflow Analysis.
The workflow aspect of the uncovered process model provides information regarding the question: “what does the actual process look like?”; by analyzing the relationship between the subprocess models or the design events. This is achieved through two aspects. First, the hierarchical tree captures the subordination relationships between the functional modules, which reflects how the entire design process is iteratively decomposed to several smaller design tasks. Second, the subprocess model within each module reflects the detailed execution traces that a specific design problem is solved. Such a hierarchical representation is able to help decision-makers to quickly locate and understand the parts they are interested in.
Performance Analysis.
The performance aspect analyzes the subprocess models according to their execution time. This is helpful to answer questions like: “are there any irregular task executions or bottlenecks in the actual process?” In product design process, irregular executions or bottlenecks usually are design tasks that slow down the whole design process. Identifying irregular executions or bottlenecks allows decision-makers to determine the area where problem occurs and identifies the root causes, so as to avoid the same mistakes in a new design project. To identify irregular executions, subprocess models are compared in a dotted chart, which represents the subprocess models or their subordinative events in the vertical axis, and the corresponding execution time in the horizontal axis. Based on the dotted chart, irregular executions and bottlenecks might be subprocess models that have an extremely long duration or were suspended frequently. The execution traces related to the irregular subprocess models should be carefully inspected, so as to detect the actual root causes such as lacking resources, waiting for the outputs of other design tasks, and having an operator who needs training.
Role Analysis.
The role analysis aims to determine the relative value of the people involved by measuring and comparing their contributions to the whole design process. By focusing on the interaction between personnel and time, generalists who were always active throughout the entire design process could be recognized as core participants, whereas people who only participated in some specific design events could be recognized as specialists in the field relevant to those design events. The dotted chart is again used to perform role analysis. The vertical axis represents the people included in the process model, the horizontal axis indicates the time, and each dot denotes that a people is involved in a design event at a time. The density of the dots reflects the contribution of the people to the whole design process. Based on the doted chart, the key personnel from a similar past project can be quickly identified and considered for a new design project.
Social Network Analysis.
where denotes all the involved people, indicates the people who are directly connected to in the social network graph, are the people in the same group with , is the interaction strength of two people, and function calculates the number of unique group labels in a set of participants.
Human Resource Utilization.
People involved in past design projects and their levels of knowledge and skills are usually valuable resources that can be transformed to produce benefit in future design projects. To extract information about human resources, such as “who executed what tasks?” and “to what degree was a people involved in a task?,” the human resource utilization aspect analyzes the relationship between the personnel and the subprocess models. A histogram is created for each subprocess model to compare the engagement of the involved people. This could provide decision-makers helpful guidance when allocating the most suitable people to a similar new design task, so as to improve the human resource utilization in the new project.
Case Study
All the algorithms integrated in the proposed process mining approach were implemented in python. The entire process mining approach was demonstrated and verified on a real case study of a university-hosted design project named “traffic wave project.” It focused on designing a traffic control system to ease the traffic congestion on expressway and published the study results in a conference paper [38]. This project had eight core participants, including students and professors from two different disciplines. Throughout the design process, they frequently contacted a lot of people from engineering design, vehicle design, control, model building, and external companies. In addition, as this project is only one of the correlated subprojects of a bigger project, core participants from other subprojects were also involved. The whole design process lasted about 2 years, from March 2011 to February 2013.
Dataset and Evaluation.
The example data were a set of emails collected from the traffic wave project. Throughout the design process, all of the participants always sent a copy to a specific common account when they used emails to exchange and discuss their opinions. This culminated in a total of 569 emails that were collected and saved in a MS Outlook file. Each email contains information about the design tasks discussed in the email body, the involved people are mentioned as either the email sender/receiver or in the email body, and the time is indicated by the creation time of the emails.
The original dataset was cleaned by removing the duplicates in the reply thread, resulting in 357 remaining emails. The personal pronouns in the email body were replaced by the names of the people in the TO, FROM, and CC fields of the corresponding emails. To give a deeper impression of the cleaned email dataset, Fig. 6 plots the histogram of email length.
To validate how well the discovered process model conform to reality, two senior participants and one junior participant from the traffic wave project were requested to sketch the originating process model embedded in the email dataset prior. Next, the quality of the discovered process model was assessed by mapping it back to the originating process model with the experts' help. To assess the correctness of the discovered knowledge patterns, every knowledge pattern was checked with the three interviewed participants. A knowledge pattern was right if it was in accordance to the experts' experience.
Experimental Results
Overview of the Hierarchical Process Model.
Three hundred sentences were randomly selected to manually seed the speech act words, and 13,734 NEs including 191 unique personal names were recognized by the hybrid NER approach. By setting the cutoff parameter as 0.8, 661 design events were detected by the event detection approach, and 41 subprocess models were constructed in the hierarchical tree.
To give an intuitive feeling of the discovered design event, Fig. 7 illustrates an example of event detection. Figure 7(a) gives an example document which consists of multiple sentences. Figure 7(b) shows the binary relation graph constructed from it. In Fig. 7(b), seven entities were recognized as task entities, i.e., “make group,” “report progress,” “redefine problem,” “be issue,” “adopt transportation system,” “target aspect,” and “shape project.” Figure 7(c) illustrates two event graphs detected from the binary relation graph in Fig. 7(b). From the entities contained in the event graphs, it can also be observed that entities within a design event are usually mentioned in different sentences, rather than all in a single sentence.
Workflow Analysis.
Figure 8 compares the automated subprocesses with the originating tasks that were given by the interviewed participants. All the subprocesses listed in Fig. 8 are named according to its most frequent event. Subprocesses for which the interviewed participants can find a relevant originating task are highlighted and connected to their originating tasks in the middle column. Figure 8 shows that 39 of the 41 subprocesses have a counterpart in the originating tasks. In contrast, eight of the nine originating tasks are connected to at least one subprocess. These findings indicate that the automated subprocesses have an actual reflection of the originating tasks, but in a more detailed view.
Figure 9 illustrates a segment of the discovered hierarchical process model. However, due to the space limitation, Fig. 9 only shows three subprocesses. The rectangular nodes present small design events, and the arrows indicate their workflows within a subprocess. From the names of the design events, it can be observed that the three subprocesses were all concerning the execution of a presentation, but with regard to different aspects. The first subprocess in Fig. 9 shows a very clear workflow of scheduling and rescheduling the presentation date, the second subprocess is about making the presentation, and the third subprocess reflects the procedure of making assessment after the presentation. According to the feedback from the interviewed participants, the three subprocess models correctly reflect the procedure that the undergraduate participants of this project did the presentation for their final year project using the achievements of this traffic wave project.
Performance Analysis.
Figure 10(a) visualizes the dotted chart of the subprocess models. The horizontal axis indicates the time, and the vertical axis presents the 41 subprocesses listed in Fig. 8. The dots are the events belonging to each subprocess. The dotted chart in Fig. 10(a) shows that most subprocesses proceeded concurrently, e.g., tasks 1–3 and tasks 4–7. From Fig. 10(a), it can also be observed that the design activities in most subprocesses have been more or less temporally suspended. Subprocesses that have been suspended frequently or for a long time might be executed irregularly. Based on this assumption, Fig. 10(a) shows that there are six subprocesses that might have irregular executions, i.e., task 3, task 6, task 9, task 12, task 16, and task 18.
Figure 10(b) plots the temporal event throughput, which calculates the number of events executed in a short period. Continuous periods that have temporal event throughputs larger than a threshold of average throughput are identified as busy periods, otherwise inactive periods. According to Fig. 10(b), this traffic wave project had four relatively busy periods and three relatively inactive periods. The design tasks started during the inactive periods might be causes that impede the project progress. In Fig. 10, the subprocesses started during the first inactive period are tasks 4–9, the subprocesses started during the second inactive period are tasks 15–17, and no subprocesses are started during the third inactive period. The interviewed participants explained that during the first inactive period, the whole design project was temporally suspended to apply an approval from a related organization, and they were not familiar about the application procedure, therefore slowing down the design progress. For the second inactive period, it was explained that the core participants spent a long time in contacting the manufacturers of different simulation tools before they found a suitable one, and task 15 is related to this procedure.
Taking the above analyses together, the detected irregular subprocesses or bottlenecks indicate the areas where problems that might slow down project progress occur. Identifying irregular executions or bottlenecks and their potential root causes allow decision-makers to be aware of such issues and avoid them in a new design project if it bears similar nature.
Role Analysis.
Figure 11(a) plots the dotted chart for comparing the relative engagement of the people involved in the traffic wave project. The vertical axis represents the 191 people detected in the discovered process model. The dots indicate that people are involved in some events at some time. The local dot density in Fig. 11(a) reflects the temporal engagement of the corresponding people in a continuous period. For example, Fig. 11(a) shows that P60-80 joined the project very late, and P160-180 were active at the beginning but withdrew midway. Such participants who were involved only in a short period might be specialist in handling some special design tasks. The global dot density of each line indicates the overall engagement of the people in the entire design process. To visualize the global dot density, Fig. 11(b) plots the number of the dots in each line. Figure 11(b) shows that P0-1 and P7-12 might be core participants of the traffic wave project as they had the largest global dot densities and were continuously active throughout the entire design process.
Social Network Analysis.
Figure 12 shows four social networks formed under different conditions. Figure 12(a) visualizes the interaction behaviors of all the 191 participants. As Fig. 11 shows that a significant portion of the participants only participated in a small number of events, Figs. 12(b) and 12(c) filter out participants less actively engaged so as to help decision-makers to focus more on the behaviors of the actively involved members. In the four social networks, the nodes in the same color mean people who are clustered in the same cliques, except for the red nodes which denote people who fail to join any clique. The solid edges depict the interaction within a clique, while the dashed edges reflect the interaction across different cliques. The interaction strength of any two connected nodes is measured by the number of their joint events.
The graphs in Fig. 12 reveal the existence of three big cliques, within which people interacted frequently. Among the three cliques, C1, including P0, P7-10, P12, and P16, corresponds to the core participants of this project. This is consistent with the observation obtained from the role analysis in Fig. 11. The nodes filtered out by the graphs in Figs. 12(b)–12(d) indicate that people in C2 and C3 engaged less actively than the people in C1. This finding is consistent with the feedback that people in C2 and C3 were not the main participants of this project, but participants from other sister subprojects.
Inspecting the degree of incoming edges across different cliques reveals that there are four participants who frequently interact with people from different cliques, which indicates that they occupy a kind of administrative position. They are P1, P18, P19, and P34, and are denoted by the four biggest nodes in Fig. 12. In addition, Fig. 12 also highlights originators who have the highest degree of incoming edges within a clique. They are P0, P21, and P23, who occupy a kind of leadership position. This observation was confirmed by the interviewed participants; that the people denoted as admins are four professors who supervised different subprojects, and P0 is the student leader of the traffic wave project.
Human Resource Utilization.
Figure 13 illustrates two histograms that compare the percentage engagement of the people involved in two example subprocesses. The engagement degree of the participants is measured by the number of events they executed in a subprocess.
Referring to Fig. 13, 14 people are involved in the first subprocess, and 19 people in the second subprocess. Figure 13 also shows that all the core members included in the clique C1 of Fig. 12, i.e., P0-1, P7-10, P12, and P16, are involved in both subprocesses more or less. Another interesting observation can be made in Fig. 13(a) is that a noncore participant, P146, executed mostly a half of the events of the first subprocess, while the most active core member only executed 7% of the events. In contrast, the histogram in Fig. 13(b) reveals that the core members played a major role in the second subprocess as two of the top-three active participants are from the core members, i.e., P0 and P1.
Discussion
The validations with the experts' help have proven the efficacy of the proposed process mining approach. With respect to the workflow analysis, the discussion with the experts revealed that the discovered process model indeed represented the innate character of their processes. Moreover, the hierarchical structure allows people to focus only on the most relevant part of the process. This is especially helpful when the complete process is flexible and not so straightforward for junior personnel who are new to the project.
For the performance analysis, it was confirmed that the major task reflected by subprocesses, namely, T4, T5, T7, T10, and T11, indeed slowed down the whole project for several months. It was also stated that the inputs of the next major task reflected by T12 an T14 in Fig. 10 had no dependence on the outputs of T4, T5, T7, T10, and T11. In this case, the entire project time could be reduced by at least 3 months if T12 and T14 were started simultaneously with T4.
In the discussion about the role analysis results, the interviewed participants were somehow surprised that 191 people were involved throughout the project as there were only eight core members at the beginning. When given the social network graphs, the three participants could name different cliques and recognize the clique consisting of the core members. It is also interesting to be pointed out that three of the four participants recognized as admins, i.e., P1, P18, and P34, failed to join any cliques in Fig. 10. What this means for social network analysis is that one cannot focus only on people who interact with each other most frequently. Instead, some participants like admins might have less interactions with regular participants, but play a significant role in making all the participants work together. Knowing different kinds of cooperation patterns would help decision-makers to allocate the most suitable human resources in the future design projects.
Additionally, the proposed process mining approach can help decision-makers gain a deeper understanding of a past design project from a more objective point of view. Because the process models are directly discovered from real-life design documents, they respect the reality and reduce the human bias introduced by conventional approaches such as surveys, interviews, and discussions. Furthermore, the discovered process model can be further analyzed to detect good practices or bad experiences such as irregular executions, delays, and bottlenecks in actual executions.
The discussion with the core participants also revealed several possibilities for future improvement and extension. The current work used project emails as the data source for case study validation. There might be situation that some design tasks were less-frequently discussed via email correspondence, but recorded by other types of design documents, such as conference minutes, progress reports, and conversation transcripts. For example, Fig. 8 shows that the process model discovered from the email dataset provides rare information about the originating task of hardware validation. Concerning this problem, the proposed approach itself is universal to other types of design documents. Therefore, as long as more design documents can be provided, the proposed approach could lead to a more comprehensive process model and other major performance patterns concerned.
The subprocesses shown in Fig. 9 also reflect some iterative design events. For example, the student participants scheduled the presentation date first. Next, they rescheduled the presentation date for several times. Such loops and iterations play a key role in measuring the efficiency of a design process and could help decision-makers to find solutions to facilitate design processes. Therefore, creating automatic approaches for identifying loops and iterations in the design process presents the opportunities to improve the ability of the discovered process model at supporting decision making.
At a more macrolevel, the proposed process mining approach also introduces the possibility of managing and retrieving past design documents in a structured, graphic manner. For example, by representing a design project as the process model uncovered from its archival documents, past design projects can be compared according to the structure similarity of their process models. This is a critical step to search for and reutilize interesting information from large quantities of past design projects.
Conclusions
This paper presented a methodology for learning from the historical design documents based on process mining. The proposed methodology was developed in two main stages: the discovery of a design process model from archival design project documents and learning multifaceted knowledge patterns from the discovered process models. Novelties of the proposed methodology include (1) proposing a new process mining approach with the capability of handling textual data; (2) capturing the flexibility of a design process via a hierarchical and modular representation; and (3) applying statistical analysis methods to learn valuable knowledge patterns from the uncovered process model. The proposed methodology has been tested using an email dataset collected from a university-level design project. The results provided evidence that the proposed approach can not only correctly reveal the actual executions of past design processes, but also return meaningful knowledge patterns to support future design project and process management.
Princeton University “About WordNet.” WordNet. Princeton University, 2010. http://wordnet.princeton.edu