Commonsense knowledge is essential for many AI applications, including those in natural language processing, visual processing, and planning. Consequently, many sources that include commonsense knowledge have been designed and constructed over the past decades.
Recently, the focus has been on large text-based sources, which facilitate easier integration with neural (language) models and application on textual tasks, typically at the expense of the semantics of the sources. Such practice prevents the harmonization of these sources, understanding their coverage and gaps, and may hinder the semantic alignment of their knowledge with downstream tasks. Efforts to consolidate commonsense knowledge have yielded partial success, but provide no clear path towards a comprehensive consolidation of existing commonsense knowledge.
The ambition of this paper is to organize these sources around a common set of dimensions of commonsense knowledge.
For this purpose, we survey a wide range of popular commonsense sources with a special focus on their relations. We consolidate these relations into 13 knowledge dimensions, each abstracting over more specific relations found in sources. This consolidation allows us to unify the separate sources and to compute indications of their coverage, overlap, and gaps with respect to the knowledge dimensions. Moreover, we analyze the impact of each dimension on downstream reasoning tasks that require commonsense knowledge, observing that the temporal and desire/goal dimensions are very beneficial for reasoning on current downstream tasks, while distinctness and lexical knowledge have little impact. These results reveal focus towards some dimensions in current evaluation, and potential neglect of others.
mode = title]Dimensions of Commonsense Knowledge
Deborah L. McGuinness
ommonsense knowledge /sepsemantics /sepknowledge graphs /sepreasoning
Commonsense knowledge is information that humans
typically have that helps them make sense of everyday situations.
As such, this knowledge can generally be assumed to be possessed by most people, and, according to the Gricean maxims , it is typically omitted in (written or oral) communication.
The fact that common sense knowledge is often implicit presents a challenge for automated natural language processing (NLP) and question answering (QA) approaches as the extraction and learning algorithms cannot count on the common sense knowledge being available directly in text.
Due to its prominence and implicit nature, capturing commonsense knowledge holds a promise to benefit various AI applications, including those in NLP, computer vision, and planning. For instance, commonsense knowledge can be used to fill gaps and explain the predictions of a (neural) model , understand agent goals and causality in stories , or enhance robot navigation and manipulation .
Consequently, acquiring and representing commonsense knowledge in machine-readable form, as well as reasoning with it, has been a major pursuit of AI since its early days .
This has resulted in
the design, construction, and curation of a rich palette of
resources that include commonsense information (potentially along with other content)
like Cyc , ATOMIC , WebChild , ConceptNet , WordNet , FrameNet , and
Visual Genome . Some of these, such as ConceptNet and Cyc, have been deliberately created to capture
information that would be useful for common sense-related reasoning tasks, while others, like WordNet or Visual Genome, were intended to support other tasks such as word sense disambiguation or image object recognition. As reported in , the commonsense sources exhibit large diversity in terms of their representation formats, creation methods, and coverage. While this reflects an opportunity for this knowledge to be exploited jointly, the inherent diversity makes the consolidation of these sources challenging.
Meanwhile, the last few years have featured a reinforced focus on benchmarks that evaluate different aspects of common sense, including social , physical , visual , and numeric  common sense. Further distinction has been made between discriminative tasks [53, 6, 59], where the goal is to pick the single correct answer from a list, and generative tasks, where one has to generate one or multiple correct answers [34, 7]. These tasks can be tackled by using the (entire or a subset of) training data [37, 33], or in a zero-/few-shot evaluation regime [38, 56].
The wealth and diversity of commonsense sources, on the one hand, and benchmarks, on the other hand, raises a natural question: what is the role of these knowledge repositories for real-world reasoning techniques that need to incorporate commonsense knowledge? While intuitively such sources of commonsense knowledge can have tremendous value on downstream reasoning tasks, the practice shows that their impact on these tasks has been relatively limited, especially in comparison to the contribution of the language models.
The impact of knowledge resources so far has been generally conditioned on the special cases where the knowledge and the task are known (in advance) to be well-aligned [38, 37].
While a variety of sources [52, 41, 58] or their combination  have been used to enhance language models for downstream reasoning, little is known about how this alignment between knowledge types and tasks can be dynamically achieved.
Most recent sources have focused on the breadth of knowledge, sometimes at the expense of its semantics [4, 43]. Text-based representations are particularly attractive, as they facilitate a more direct integration with language models, as well as reasoning on NLP and QA tasks. These sources are often treated as ‘corpora’, where each fact is typically lexicalized (manually or automatically) into a single sentence , which is used to inform or fine-tune a language model.
Due to the lack of focus on formal representational principles, the sources capture knowledge types which are not trivial to align with other sources, as shown by the sparse mappings available between these sources .
lack of a common vocabulary and/or lack of alignment of these sources, their limited coverage, and lack of focus on explicit semantics,
knowledge is typically kept in an impoverished textual form that is easy to capture and combine with language models. The downsides of this practice are: 1) commonsense knowledge across sources remains difficult to harmonize; 2) without a thorough harmonization or consolidation, it is not clear how to effectively measure coverage, overlap, or gaps; and 3) text-based representations may be unable to capture the richness of contextual reasoning typically done by humans.
Efforts to consolidate commonsense knowledge across sources [27, 17, 44] have managed to bring these sources closer, which has shown impact on commonsense QA tasks . In , we provide heuristics for defining the boundaries of commonsense knowledge, in order to extract such subset from one of the largest available graphs today, Wikidata . Yet, these efforts have limited success, and many consolidation questions are left open. How should one think about commonsense knowledge in a theoretical way? What does it mean to build a consolidated knowledge graph (KG) of resources created largely in a bottom-up fashion? How should the relations be chosen? What is the right level of abstraction for relations and nodes?
The ambition of this paper is to provide insight into such questions, aiming primarily to organize the types of knowledge found in current sources of commonsense knowledge.
For this purpose, we survey a wide variety of sources of commonsense knowledge, ranging from commonsense KGs through lexical and visual sources, to the recent idea of using language models or corpora as commonsense knowledge bases. We survey their relations and group them into a set of dimensions, each being a cluster of its specific relations, as found in the sources. We then apply these dimensions to transform and unify existing sources, providing an enriched version of the Commonsense Knowledge Graph .
The dimensions allow us to perform four novel experiments:
We assess the coverage of the sources with respect to each dimension, noting that some sources have wide (but potentially shallow) coverage of dimensions, whereas others have deep but narrow coverage. This supports the need to integrate these complementary sources into a single one.
We benefit from the consolidation of the dimensions to compare the facts in the sources and compute metrics of overlap. The results show that there is little knowledge overlap across sources, even after consolidating the relations according to our dimensions, thus motivating future work on node resolution.
We contrast the clusters according to our dimensions to language model-based clusters, to understand the similarities and differences in terms of their focus.
We measure the impact of each dimension on two representative commonsense QA benchmarks. Following , we pre-train a language model and apply it on these benchmarks in a zero-shot fashion (without making use of the task training data). The dimensions provide a more direct alignment between commonsense knowledge and the tasks, revealing that some dimensions of knowledge are very helpful for a task, while others might even degrade model performance.
The contributions of the paper are as follows. 1) We survey existing sources of commonsense knowledge of a wide variety, with an emphasis on their relations.
We provide a categorization of those resources and include a short overview of their focus and creation methods (Section 3).
2) We analyze the entire set of relations and abstract them to a set of 13 commonsense dimensions. Each dimension abstracts over more specific relations, as found in the sources (Section 4). 3) The identified dimensions are applied to consolidate the knowledge in the Commonsense Knowledge Graph (CSKG), which integrates seven of the sources we analyze in this paper. The resulting resource is made publicly available (Section 5).
4) We make use of this dimension-based consolidation of CSKG to
analyze the overlap, coverage, and knowledge gaps of individual knowledge sources in CSKG, motivating their consolidation into a single resource (Sections 5.1 – 5.3). 5) We evaluate the impact of different dimensions on
two popular downstream commonsense reasoning tasks. The results show that certain dimensions, like temporal knowledge and knowledge on desires/goals are very beneficial and well-covered by benchmarks, whereas other dimensions like distinctness and lexical knowledge currently have little impact. These results reveal more precise alignment between dimensions in the resources and existing tasks, and point to gaps in both existing knowledge sources and in tasks (Section 5.4).
6) We reflect on the results of our analysis, and use it as basis to provide a roadmap towards building a more semantic resource that may further advance the representation of, and reasoning with, commonsense knowledge. Such a resource would be instrumental in building a general commonsense service in the future (Section 7).
3 Sources of Commonsense Knowledge
We define a digital commonsense knowledge source as a potentially multi-modal repository from which commonsense knowledge can be extracted.
Table 1 contains statistics and examples for each source.
|Category||Source||Relations||Example 1||Example 2|
|Commonsense KGs||ConceptNet*||34||food – capable of – go rotten||eating – is used for – nourishment|
|ATOMIC||9||Person X bakes bread – xEffect – eat food||PersonX is eating dinner – xEffect – satisfies hunger|
makes (that is food) Causes/Enables
|WebChild||4 (groups)||restaurant food – quality#n#1 – expensive||eating – type of – consumption|
|Quasimodo||78,636||pressure cooker – cook faster – food||herbivore – eat – plants|
|SenticNet||4||cold_food – polarity – negative||eating breakfast – polarity – positive|
|HasPartKB||1||dairy food – has part – vitamin||n/a|
|Common KGs||Wikidata||6.7k||food – has quality – mouthfeel||eating – subclass of – ingestion|
|YAGO4||116||banana chip – rdf:type – food||eating – rdfs:label – feeding|
|SUMO*||1,614||food – hyponym – food_product||process – subsumes – eating|
|Lexical resources||WordNet||10||food – hyponym – comfort food||eating – part-meronym – chewing|
|Roget||2||dish – synonym – food||eating – synonym – feeding|
|FrameNet||8 (f2f)||Cooking_creation – has frame element – Produced_food||eating – evoke – Ingestion|
|MetaNet||14 (f2f)||Food – has role – food_consumer||consuming_resources – is – eating|
|VerbNet||36 (roles)||feed.v.01 – Arg1-PPT – food||eating – hasPatient – comestible|
|Visual sources||Visual Genome||42,374||food – on – plate||boy – is eating – treat|
|Flickr30k||1||a food buffet – corefers with – a food counter||a eating place – corefers with – their kitchen|
|Corpora & LMs||GenericsKB||n/a||Aardvarks search for food.||Animals receive nitrogen by eating plants.|
|GPT-2||n/a||Food causes a person to be hungry and a person to eat.||Eating at home will not lead to weight gain.|
3.1 Commonsense Knowledge Graphs
ConceptNet  is a multilingual commonsense knowledge graph. Its nodes are primarily lexical and connect to each other with 34 relations. Its data is largely derived from the crowdsourced Open Mind Common Sense (OMCS) corpus 
, and complemented with knowledge from other resources, like WordNet.
ATOMIC  is a commonsense knowledge graph that expresses pre- and post-states for events and their participants in a lexical form with nine relations. Its base events are collected from a variety of corpora, while the data for the events is collected by crowdsourcing.
GLUCOSE  contains causal knowledge through 10 relations about events, states, motivations, and emotions. The knowledge in GLUCOSE is crowdsourced based on semi-automatic templates, and generalized from individual stories to more abstract rules.
WebChild  is a commonsense knowledge graph whose nodes and relations are disambiguated as WordNet senses. It captures 20 main relations, grouped in four categories. WebChild has been extracted automatically from Web information, and canonicalized in a post-processing step.
Quasimodo  contains commonsense knowledge about object properties, human behavior, and general concepts. Its nodes and relations are initially lexical and extracted automatically from search logs and forums, after which a notable subset of them has been clustered into WordNet domains.
SenticNet  is a knowledge base with conceptual and affective knowledge, which is extracted from text and aggregated automatically into higher-level primitives.
3.2 Common Knowledge Graphs and Ontologies
Wikidata  is a general-domain knowledge graph, tightly coupled with Wikipedia, that describes notable entities. Its nodes and relations are disambiguated as Qnodes. The content of Wikidata is collaboratively created by humans, as well as other existing sources. Given the vast number of statements in Wikidata and its sizable set of over 7 thousand relations, we consider its Wikidata-CS commonsense subset, as extracted in .
YAGO  is a general-purpose knowledge graph, whose nodes and relations are disambiguated entities. The knowledge in YAGO is extracted automatically from Wikipedia, and consolidated with knowledge from other sources, like Schema.org 
DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering)  is an upper level ontology that captures the ontological categories underlying natural language and human common sense with disambiguated concepts and relations. It has been created manually by experts.
SUMO (Suggested Upper Merged Ontology)  is an ontology of upper-level disambiguated concepts and their relations. It has been created manually by experts.
3.3 Lexical Resources
Roget  is a manually-created thesaurus that contains synonyms and antonyms for English words.
FrameNet  is a lexical resource that formalizes the frame semantics theory: meanings are mostly understood within a frame of an event and its participants that fulfill roles in that frame. FrameNet was created manually by experts.
MetaNet  is a repository of conceptual frames, as well as their relations which often express metaphors. It has been created manually.
VerbNet  is a resource that describes syntactic and semantic patterns of verbs, and organizes them into verb classes. It has been created manually by experts.
3.4 Visual Commonsense Sources
Visual Genome  contains annotations of concepts and their relations in a collection of images. The image descriptions are manually written by crowd workers, while their concepts are mapped automatically to WordNet senses and revised by crowd workers.
3.5 Corpora and Language Models
GenericsKB  contains self-contained generic facts represented as naturally occurring sentences. The sentences have been extracted from three existing corpora, filtered by handwritten rules, and scored with a BERT-based classifier.
As apparent in this section, the commonsense sources are based on a wide range of representation principles and have been created with different construction methods. Through the example scenarios of food and eating (Table 1), we show that they have notable overlap in terms of their covered (typically well-known) concepts. At the same time, the types of knowledge covered differ across sources: some sources provide truisms, such as feeding is done with food, while others speculate on usual properties of food, such as its capability to go rotten or often be on a plate. Furthermore, we observe that same or similar relations tend to have different names across sources (compare type of to subclass of or is; or has quality in Wikidata to cook faster in Quasimodo).
These distinctions make the integration of these sources, and the understanding of their coverage and gaps, very challenging. In order to integrate the knowledge in these sources, we next propose a consolidation of their relations into a common set of dimensions.
|DefinedAs||Synonym (RG)||said to be the same as|
|Antonym||Antonym (RG)||different from|
|distinctness||DistinctFrom||antonym (WN)||opposite of|
|HasA||physicalPartOf||meronym (WN)||has part|
|part-whole||MadeOf||memberOf||holonym (WN)||member of|
|utility||UsedFor||hassynsetmember||using (FN)||used by|
|xReact||HasPrerequisite||prev||causative_of (FN)||has cause|
|RelatedTo||field of this occupation|
|-other||EtymologicallyRelatedTo||agent||requires (FN)||health specialty|
4 Dimensions of commonsense knowledge
In the previous section, we surveyed 20 representative commonsense sources from five categories: commonsense KGs, common KGs, lexical, visual sources, and corpora and language models. A key contribution of this paper is a manual categorization (by the authors) of the kind of knowledge expressed by the relations in these sources into 13 dimensions.
Table 2 shows the correspondence of each relation in these analyzed sources to our dimensions. An example for each of the dimensions from different sources is shown in Table 3. We next describe each dimension in turn.
Lexical. Many data sources leverage the vocabulary of a language or the lexicon in their relations. This includes relationships such as plural forms of nouns, or past tenses of verbs, for example. Lexical knowledge also covers substring information. ConceptNet, for example, includes a relationship called DerivedFrom that they describe as capturing when a word or phrase appears within another term and contributes to that term’s meaning. Lexical knowledge is also the formalization of the relation between a concept and its expression in a language, e.g., denoted by through the label relation in Wikidata.
Similarity. Most data sources include the notion of synonymy between expressions, allow definitions of terms, or may cover a broader notion of just general similarity.
ConceptNet has all three subcategories – for instance, regarding similarity, it establishes that wholesome and organic food are similar notions, while eating is defined as process of taking in food. WebChild also captures similarity between WordNet concepts, while WordNet, Wikidata, and Roget focus on synonymy. For instance, Roget declares that food and edibles are synonyms, while Wikidata expresses that food is said to be the same as nutriment.
Distinctness. Complementary to similarity, most data sources have notions of some kind of distinguishability. Most commonly, this is formalized as antonymy, where words have an opposition relationship between them, i.e., they have an inherently incompatible relationship. For example, both Roget and ConceptNet consider hot and cold to be antonyms, as these are two exclusive temperature states of objects. FrameNet defines an Excludes relation to indicate that two roles of a frame cannot be simultaneously filled in a given situation. For instance, in the Placing frame, an event can either be brought by a cause event or by an intentional agent, but not both. Weaker forms of distinctness are defined by Wikidata and ConceptNet, for concepts that might be mistaken as synonyms. For example, Wikidata states that food safety is different from food security, while ConceptNet distinguishes food from drinks.
|lexical||derivationaly related form: nutrient||WordNet|
|etymologically related: fodder||ConceptNet|
|derived term: foodie||ConceptNet|
|said to be the same as: nutriment||Wikidata|
|similar to: wholesome – organic||ConceptNet|
|distinctiveness||opposite of: non-food item||Wikidata|
|distinct from: drink||ConceptNet|
|different from: food safety – food security||Wikidata|
|taxonomic||hyponym: comfort food||WordNet|
|subclass of: disposable product||Wikidata|
|part-whole||things with food: minibar||ConceptNet|
|is part of: life||COMET|
|material used: food ingredient||Wikidata|
|spatial||is located at: pantry||ConceptNet|
|is located at: a store||ConceptNet|
|location: toaster – kitchen||Wikidata|
|located near: plate||Visual Genome|
|located near: table||Visual Genome|
|creator||is created by: cook||COMET|
|is created by: plant||COMET|
|used by: organism||Wikidata|
|used for: pleasure||ConceptNet|
|used for: sustain life||COMET|
|used for: nourishment||ConceptNet|
|capable of: cost money||ConceptNet|
|capable of: go rotten||ConceptNet|
|is capable of: taste good||COMET|
|goal/desire||xWant: watch movie together – get some food||ATOMIC|
|desires: regular access to food||ConceptNet|
|not desires: food poisoning||ConceptNet|
|causes desire to: eat||ConceptNet|
|xIntent: eats food – quit feeling hungry||ATOMIC|
|motivated by: cook a meal||ConceptNet|
|is motivated by: you be hungry||COMET|
|quality||xAttr: makes food – creative||ATOMIC|
|has quality: shelf life||Wikidata|
|has the property: tasty||COMET|
|comparative||healthier: home cooking – fast food||WebChild|
|temporal||has first subevent: cooking||ConceptNet|
|starts with: open your mouth||COMET|
|has effect: food allergy||Wikidata|
|causes: you get full||COMET|
|relational-other||related to: refrigerator||ConceptNet|
|related to: cereal||ConceptNet|
|field of work: food bank – food assistance||Wikidata|
|main subject: cuisine – food product||Wikidata|
Taxonomic. Most data sources include a kind of arrangement classification where some objects are placed into more general and more specific groupings with inheritance relations. When those groupings are ordered categories based on generality, this captures the notion of hyponymy, indicating a subcategory relationship. Hyponymy blends the distinction between the relationships subclass / IsA (intended for two classes)
and InstanceOf (intended as a relation between an instance and a class). For instance, Wikidata states that a sandwich wrap is street food, or that food is a disposable product. WordNet has information that beverage and comfort food are hyponyms of food. While this dimension generally focuses on concepts (nouns), it also includes a specialization relation for verbs. Here, the MannerOf relation in ConceptNet states that wheezing is a manner of breathing.
Part-whole. Many data sources include a notion of being a part of or a member of something. Part-whole knowledge can be transitive, such as that of geographic containment, exemplified by New York City being a part of New York State, which is also part of the United states. Other part-of notions, such as member-of are not necessarily transitive. A third category of part-whole knowledge is expressed with the material or the building blocks of an object, such as food being made of food ingredients. A useful distinction between these three notions of part-whole: physical part of (sunroof – car), member of (musician – duet), and substance of (steel – boiler), is provided by WebChild. In addition, the importance of this commonsense dimension is shown by HasPartKB , which is an entire resource dedicated to part-whole relations.
Spatial. Spatial relations describe terms relating to or occupying space. This may entail indicating a usual location of a concept, as in the location property in wikidata or the AtLocation in ConceptNet. ConceptNet expresses locations for geographic entities, for example Boston is at location Massachusetts, as well as for things that can contain things: butter is at location refrigerator. Similarly to the latter case, Wikidata includes an example that toasters are located in kitchens. A weaker spatial relation is one of spatial proximity in WebChild or ConceptNet, specifying that, e.g., bikes are located near roads. While Visual Genome does not explicitly have a spatial relation, concepts occurring in the same image region can be represented with the LocatedNear relation . Example such statements include food being located near a plate or a table.
This dimension describes the process or the agent that brought something into existence. ConceptNet gives an example that a cake is created by the bake process, COMET has information that food is created from plants, while Wikidata states that rifle factories create shotguns. Table 2 reveals that no other source has creation information.
Utility. This dimension covers a notion of fitness or usefulness of objects for some purpose.
ConceptNet’s relation UsedFor expresses knowledge that ‘the purpose of A is B’, with an example of food being used for pleasure or nourishment.
Wikidata has several similar relations: use, used by, and uses, which can express that platter is used for food presentation, or food is used by organisms.
ConceptNet includes the notion of CapableOf, described as ‘A is capable of B if A can typically do B’, like food being capable of going rotten, or knives being capable of cutting. Another related notion is that of receiving an action: a button may receive the push action. While a button does not have the sole purpose of being pushed, it is capable of receiving that action, and by inference, it may respond to the action.
Desire or goal. This dimension covers knowledge about agent desires or goals. An agent may want to have something or wish for something to happen. The agent typically has certain goals, aims, and/or plans, that may motivate or explain those desires.
The relation Desires in ConceptNet may indicate, e.g., that a person desires regular access to food. Its negated version, NotDesires expresses that a person does not desire poisoned food. ATOMIC has two relations: xWant and oWant, to indicate the desires of an agent or other agents in a given situation. For instance, when people watch a movie together, they want to get some food.
Regarding goals, ConceptNet includes the MotivatedByGoal and ObstructedBy relations to indicate the motivation and the constraint for a certain action. For instance, ConceptNet indicates that one’s sleep is obstructed by noise, while COMET’s extension of ConceptNet posits that people cook a meal because they are hungry.
Quality. Commonsense sources typically describe attributes of an agent or qualities related to an object.
For example, ConceptNet and COMET include the relation HasProperty, to express knowledge like ice having the property cold and the food has property tasty.
ATOMIC uses xAttr to indicate that, for example, the person that cooks food often has the attribute hungry or creative. WebChild and Wikidata both provide more specific qualities, such as taste, temperature, shape, or color. For instance, WebChild would specify the plant color as green.
Comparative. WebChild performs comparison of objects based on relative values for their attributes. Example comparative relations in WebChild are: healthier than (home cooking – fast food), faster than (car – bike), and larger than (lion – hyena). Notably, no other source describes comparative knowledge explicitly.
Temporal. Most sources have notions of time that may support ordering by time and/or may capture relations that one thing is a prerequisite for another or one thing may have a particular effect.
ConceptNet, for example, expresses that the first event of eating may be cooking, while the last one could be getting rid of the containers. COMET states that eating starts with opening one’s mouth. More strongly, the temporal relations often indicate relative ordering of two events, through relations of causation and effects, such as food potentially causing allergy or indigestion. Such causal knowledge is found in ATOMIC, ConceptNet, COMET, WebChild, and Wikidata.
Relational-other. Conceptual and context-related relationships are often underspecified. On the one hand, increasingly some sources capture description of the circumstances that for the setting for a statement, event, or idea. ConceptNet has a single relation HasContext for this, while Wikidata has more concrete contextual relations, such as field of this occupation, depicts, and health specialty. This allows Wikidata to express that the main subject of a cuisine is a food product, and that the field of work of food banks is food assistance.
On the other hand, most of the knowledge in ConceptNet belongs to a generic relation called RelatedTo that may be used to capture a relatively vague semantic connection between two concepts, such as food being related to refrigerator or cereal.
Our organization of existing relations into 13 dimensions provides a unified framework to reorganize and consolidate these sources. Here, we discuss two nuances of our process. First, we placed the negative statements (marked with in Table 2) in the same dimension as the positive ones, as they cover the same knowledge type, despite having a different polarity and, arguably, purpose. Following a similar line of reasoning, we also placed inverse relations, such as used for and uses, in the same dimension. Second, we recognize that the underlying data may not always be clearly placed in one of these dimensions. For instance, the relation AtLocation, which intuitively should belong to the spatial category, contains some statements that express part-whole knowledge.
Seven of the sources covered in the previous section: ConceptNet, ATOMIC, Visual Genome, WordNet, Roget, Wikidata-CS, and FrameNet, have been integrated together in the Commonsense Knowledge Graph (CSKG) . We start with CSKG and apply our dimension classification (section 4) to its sources, under an assumption that each of their edge relations can be mapped unambiguously to one of the dimensions.
This enrichment of the CSKG graph allows us to study the commonsense knowledge dimensions from multiple novel perspectives.
We investigate the following questions:
How well is each dimension covered in the current sources? Here we compute the number of edges for each dimension across sources.
Is knowledge redundant across sources?
In experiment 2, we use the dimensions to quantify overlap between sources with respect to individual edges.
What is the impact of each dimension for reasoning on QA tasks?
Each of the dimensions is used to select a subset of the available knowledge in CSKG. The selected knowledge is then used to pretrain a RoBERTa language model, which is applied to answer commonsense questions in a zero-shot manner.
In this section, we formulate and run suitable studies for each of the four questions, and reflect on the results.
5.1 Experiment 1: How well is each dimension covered in the current sources?
We use the CSKG graph enriched with edge dimensions to compute source coverage with respect to each dimension. The coverage of each source, formalized as a count of the number of edges per dimension, is presented in Table 4.
We observe several trends in this Table. First, there is much imbalance between the number of sources per dimension. Comparative knowledge and creation information are very rare and are described by only one or two sources, whereas taxonomic, temporal, and similarity knowledge are much more common and are captured by most sources. Second, some of the dimensions, like creation or part-whole, are represented with relatively few edges, whereas similarity and taxonomic knowledge generally have much larger number of edges. The exception for the former is the large number of part-whole statements in WebChild, which is due to the fact that WebChild is automatically extracted, resulting in many duplicates and noisy information. Third, we see that some sources, like ConceptNet, FrameNet, and Wikidata-CS, aim for breadth and cover most dimensions. Others, like Roget and ATOMIC, have a narrow focus on specific dimensions: primarily desires/goals and temporal knowledge in ATOMIC, and only knowledge on similarity and distinctness in Roget. Yet, the narrow focus generally coincides with much depth, as both sources have many edges for the small set of dimensions that they cover. FrameNet, having a broad focus, has a small number of edges for each dimension due to its limited coverage of lexical units. Again here, WebChild is a notable outlier with a large number of automatically extracted statements for most dimensions. Finally, we observe different ratios between ‘strong’ and ‘weak’ semantic relations across sources. Most of ConceptNet’s knowledge falls under the generic relational-other category, whereas only a small portion of Wikidata-CS belongs to the same dimension. Most of Wikidata-CS is taxonomic knowledge.
|CN – RG||57,635 (1.23%)||73,992 (1.60%)|
|CN – WD||2,386 (0.07%)||2,623 (0.08%)|
|CN – WN||86,006 (2.14%)||97,946 (2.60%)|
|RG – WD||299 (0.02%)||333 (0.02%)|
|RG – WN||75,025 (3.55%)||75,025 (3.93%)|
|WD – WN||1,697 (0.19%)||1,704 (0.25%)|
5.2 Experiment 2: Is knowledge redundant across sources?
Our analysis so far reveals that most dimensions are covered by more than one source. This leads us to the next question: how often is a statement found in multiple sources?
Computing edge overlap between sources is conditioned on identity mapping between their nodes and relations.
While CSKG provides such identity mappings between some of its nodes, this cannot be expected to be complete. We align the edges as follows.
The nodes across sources are naively compared through their labels.
As a dimension-based enhancement, we abstract each relation further by mapping it to our dimensions, e.g., transforming (food, /r/Synonym, dish) to (food, similarity, dish). This dimension-based transformation allows for more flexible matching within a dimension, for instance, enabling similarity and synonymy statements to be compared for equivalence, since both (food, /r/Synonym, dish) and (food, /r/SimilarTo, dish) would be normalized to (food, similarity, dish).
We apply the relation-based and dimension-based variants to compute overlap between four sources: ConceptNet, Roget, Wikidata, and WordNet, in terms of each dimension. Here we do not consider ATOMIC or FrameNet, as their edges can be expected to have extremely low lexical overlap with the other sources. The overlap is computed as a Jaccard score between the number of shared triples between two sources and the union of their triples. The obtained scores are given in Table 5.
Next, we inspect the overlap between these sources per dimension: Will the edges that correspond to more commonly found dimensions (e.g., part-whole, see Experiment 1) occur more often in multiple sources? We provide insight into this question in Table 6. Primarily, this Table reveals that there is very little edge overlap across sources. As hypothesized, most of the shared edges belong to dimensions that are common in many commonsense sources, such as taxonomic, similarity, and part-whole. The highest Jaccard score is obtained on the taxonomic knowledge between ConceptNet and WordNet, followed by similarity knowledge in ConceptNet-Roget and Roget-WordNet.
Wikidata and ConceptNet share edges that belong to a number of other dimensions, including distinctness, similarity, and rel-other.
5.3 Experiment 3: How do the dimensions of the edges compare to their language model encoding?
Next, we investigate how the information captured by our dimensions relates to the encoding of edges by state-of-the-art Transformer-based language models, like BERT or RoBERTa.
The two approaches for computing clusters, based on our dimensions and based on Transformer embeddings, can now be compared in terms of their agreement. We use the adjusted rand index (ARI) metric to measure the agreement.
We pick a random sample of 5,000 edges. To understand the information encoded by RoBERTa, we visualize its k-means clusters with UMAP (Figure 1).
Curiously, certain clusters are clearly delineated, while others are not. For instance, cluster 5 has little overlap with the other clusters. Looking into the contents of this cluster, we observe that it is largely dominated by distinctness information: 92% (360 out of 390) of its edges belong to this dimension, mostly expressed through the /r/Antonym relation. Clusters 4, 7, and 8 are largely dominated by similarity, while clusters 1 and 6 are largely split between temporal (46%) and desire/goal (36%) edges. At the same time, we observe a lot of overlaps between the clusters 0, 2, 9, 10, 11, and 12. These clusters are dominated by lexical and relational-other edges – e.g., around half of all edges in clusters 0 and 9 belong to the category relational-other. The node frequency distributions reveal that cluster 1 describes positive emotions, as its most frequent node is /c/en/happy; nodes in cluster 5 are often numbers, like rg:en_twenty-eighth; and the nodes in cluster 9 describe concepts from natural sciences, the most connected node being /c/en/zoology.
|RoBERTa cluster||Dimension cluster||Jaccard|
|0||lexical (0.205), rel-other (0.121), taxonomic (0.060)|
|1||desire (0.258), temporal (0.202), quality (0.087)|
|2||lexical (0.133), rel-other (0.122), taxonomic (0.075)|
|3||spatial (0.119), lexical (0.061), quality (0.053)|
|4||similarity (0.21), quality (0.015), lexical (0.008)|
|5||distinctness (0.916), lexical (0.018), taxonomic (0.003)|
|6||temporal (0.322), desire (0.283), quality, (0.054)|
|7||similarity (0.295), quality (0.009), taxonomic (0.005)|
|8||similarity (0.452), lexical (0.004), taxonomic (0.003)|
|9||rel-other (0.143), lexical (0.085), taxonomic (0.081)|
|10||rel-other (0.169), taxonomic (0.015), quality (0.007)|
|11||rel-other (0.11), quality (0.068), taxonomic (0.066)|
|12||rel-other (0.183), taxonomic (0.059), spatial (0.053)|
In Figure 2, we visualize the same set of edges, only this time we color each according to their dimension. In accordance with the relatively low rand index score, we observe that the clusters are mostly not well-distinguished from one another. We look for correspondences between the RoBERTa clusters in Figure 1 and the dimension-based clusters in Figure 2, by computing Jaccard score between the edges that constitute each pair of their clusters. The 10 cluster pairs with the highest Jaccard scores are shown in Table 7. We observe the highest correspondence for distinctness with cluster 5 (Jaccard score of 0.92) and similarity with cluster 8 (score 0.45). The overlapping clusters for desire and temporal knowledge both map relatively strongly to clusters 1 and 6, which appear nearby in the RoBERTa clustering as well. We also observe relatively high score for the cluster pairs 4-similarity, 7-similarity, and 0-lexical, all of which confirm our prior analysis of Figure 1.
We show the top-3 highest-scored dimensions for each of the automatically-computed clusters in Table 8. Here, we observe that some dimensions, like creation, utility, and part-whole do not fall within the top-3 dimensions for any of the RoBERTa clusters. This is likely due to the lower number of edges with these dimensions, as well as their dispersion across many RoBERTa clusters.
To investigate further, we select the two CSKG nodes we considered earlier in Table 3: /c/en/food and /c/en/eat, and visualize all their edges according to their dimension. The total set of 2,661 edges largely belongs to the dimension relational-other (1,553), followed by temporal (319 edges), and desire/goal (228 edges), whereas no edge belongs to the creation dimension. Thus, most of the well-specified edges about nutritional concepts express either temporal information about the process, or knowledge about desires/goals relating to nutrition.
Within a single dimension, most relational-other edges are expressed with RelatedTo (1,488 out of 1,553). Temporal knowledge is split into multiple relations, primarily HasLastSubevent (113 edges), HasPrerequisite (69), and HasSubevent (60). Desire/goal divided into at:xWant (78 edges), MotivatedByGoal (47), and at:xIntent (47 edges).
Besides the two seed nodes (/c/en/food and /c/en/eat), the frequency distribution of nodes in a cluster reveals other prominent nodes. Naturally, the spatial dimension includes /c/e/plate (with an edge degree of 3), the temporal cluster includes /c/en/diminish_own_hunger (degree of 4), and the distinctness cluster has 7 edges for /c/en/drink.
5.4 Experiment 4: What is the impact of each dimension for reasoning on QA tasks?
Experiment 3 revealed overlaps between the information captured by our dimensions and that captured by language models. In our fourth experiment, we experiment with enhancing language models with knowledge belonging to individual dimensions, in order to examine the effect of different dimensions of knowledge on commonsense reasoning tasks.
We adopt the method proposed by  to pretrain state-of-the-art language models and conduct zero-shot evaluation on two commonsense question answering tasks. According to this method, we first transform ConceptNet, WordNet, Wikidata, ATOMIC, and Visual Genome into synthetic QA sets. We use templates to map each triple into a QA pair and apply random sampling with heuristic-based filtering to collect two distractors for every QA pair. We group synthetic QA pairs based on their dimension, resulting in 12 dimension-based QA buckets in total. Within each dimension, the QA data is split into training and development sets. For ATOMIC, we adopt its original split to partition data. For the other knowledge graphs, we partitioned 95% of data into training set and remaining 5% for devevelopment, following . It is worth noting that  only selected 14 relations from ConceptNet, WordNet, Wikidata, and Visual Genome, whereas we include all relations except RelatedTo. The statistics for the synthetic QA sets are shown in table 9. We can see that the distribution of knowledge across dimensions is fairly skewed, with creation having very few questions, while taxonomic and temporal knowledge being the most numerous. Our experiments in this section will reveal whether the amount of available knowledge affects downstream task performance.
We pretrain the RoBERTa-large  model on each of the dimensions using the corresponding synthetic QA set. We use RoBERTa with a marginal ranking objective, as this is the best combination according to .
We use the same set of hyper-parameters as in , except for the creation dimension. Specifically, we train our models for epoch using learning rate , batch size and margin . For the creation dimension, since the number of samples is much smaller, we train the model for epochs while keeping the other hyper-parameters fixed.
We evaluate our models on two tasks: the CommonsenseQA task , in which the model is asked to choose the correct answer from five options given only the question, and SocialIQA task , in which the model chooses the correct answer from three options given a question and a brief context.
The results from our experiments are shown in Table 10.
Overall, we can see that with the additional pretraining on transformed knowledge graphs, the models are able to outperform the no-knowledge baseline on all dimensions. However, the variance of the improvement across dimensions is relatively large, revealing that certain dimensions are more relevant for downstream tasks than others. For example, although the training set size of lexical dimension exceeds 107K, its performance gain on both tasks is limited. We think that this is because the language model already learned most of the lexical knowledge from pretraining on unstructured text corpora, thus it could not benefit much from additional training. While the quality dimension has a similar training set size as the lexical dimension, the model benefits from it by a large margin on both tasks: 20.7 and 12.7 absolute points, respectively. This finding suggests that quality knowledge is novel and useful, as it may not be easy to learn from unstructured text.
Also, we note that downstream tasks benefit more from the knowledge dimensions that align with their question types. For example, each question in SIQA corresponds to an ATOMIC relation, and requires knowledge primarily about the order of events, personal attributes, and agent desires. Consequently, pretraining on quality, temporal, and desire/goal knowledge provides the model with the largest gain on SIQA task. The results of the temporal dimension are even higher than training on the entire set of questions, suggesting that certain knowledge dimensions that are not related to SIQA may even lead to a decline in model performance. For the CSQA task, since it is derived from the broad set of knowledge dimensions covered in ConceptNet, we expect that many (if not all) of these dimensions would help performance. Accordingly, we observe large gains with many of the knowledge dimensions (+15%), whereas the utility dimension yields the best performance (+22.4%), even slightly better than that of the entire set.
To better understand the impact of every dimension of knowledge on different questions, we further break down the performance of the models on every question type. Specifically, for CSQA, we classify questions based on ConceptNet relations between the correct answer and the question concept, as the model needs to reason over such relations to be able to answer the question. For SIQA, since the questions are initially generated using a set of templates based on ATOMIC relations, we try to reverse-engineer the process by manually defining a mapping from question format to ATOMIC relations. Using this method, we are able to successfully classify more than 99% of questions in SIQA dev set. Then we compute the averaged accuracy over 3 seeds for every question type for models trained on each dimension. The results for CSQA are shown in Figure 4 and results for SIQA are shown in Figure 5.
For some questions types of CSQA, (one of) the largest improvements is achieved on the corresponding knowledge dimension. For example, temporal on Causes and desire/goal on Desires. However, in other cases, the accuracy boost brought about by the corresponding knowledge dimension is significantly lower than other dimensions, for example, distinctness on Antonym compare to utility. This might be a signal that the knowledge represented within these dimensions might not be clearly separated. For SIQA, the results show that the corresponding knowledge dimension is more clearly helping for most question types: desire/goal on xWant, xIntent, quality on xAttr and temporal on xNeed, oReact, oEffect. This is especially visible for xIntent and xNeed, where very little gain is observed for most knowledge dimensions except for the corresponding dimension, suggesting that the alignment between the questions and knowledge dimensions is important for the model’s success. We note that a similar finding on the alignment between knowledge and the task has been reported in the original paper ; yet, the dimensions allow us to validate this claim more precisely.
Finally, to verify our hypothesis that certain dimensions of knowledge are already learned by the state-of-the-art language models to a large extent, while others are not, we directly evaluate the baseline LM on the synthetic QA sets for each dimension. The results are shown in table 11. As expected, even without any training, the model already achieves a very high accuracy (over 90%) on the lexical dimension, thus it could not receive much training signal from this dimension. On the other hand, the accuracy on quality, temporal, and desire/goal dimensions are significantly lower. This is mostly because questions from ATOMIC take a large portion in these dimension. As reported in , the questions from ATOMIC are more challenging than those created from other knowledge resources. We note that the accuracy for relational-other is the lowest among all dimensions. We hypothesize that this is because this dimension is more noisy than others and the knowledge in this dimension is less likely to be found in the unstructured text. We leave further investigation on this issue for future research.
In summary, we observe that certain dimensions of knowledge are very beneficial and novel for language models, allowing them to improve their performance on downstream reasoning tasks. Other dimensions, like lexical knowledge, are almost entirely redundant, as the language models have already acquired this knowledge during their initial training. The exact contribution for each dimension depends on the knowledge required by the task at hand. We discuss the implications of the obtained results further in Section 7.
6 Related Work
6.1 Epistemic Foundations
World knowledge is what people learn and generalize from physical and social experiences, distilling mental representations out of the most significant aspects of everyday life.
From Aristotleâs theory of categories  to Brentano’s empirical psychology , deriving knowledge dimensions from empirical observations, as opposed to from abstract criteria , is a fundamental epistemic approach that contributed to the birth of Cognitive Science as a discipline , besides serving as reference framework for our current investigation.
In this article, in fact, we neither propose nor adopt any a priori principle to define what should be included in a common sense knowledge graph; rather, we analyze multiple knowledge graphs and, supported by empirical methods and experimental validations, elicit their most salient conceptual structures.
In the history of general knowledge base systems, the difficulty of characterizing commonsense has been a driver, rather than an obstacle. For instance, Cyc , the most monumental effort to construct an axiomatic theory of common sense knowledge, has been actively growing for almost forty years. At present, the Cyc knowledge base comprises around 1.5 million general concepts, and 25 million rules and assertions about these general concepts. Different domain-specific extensions of Cyc exist, funded by industry and government programs: considering the key role that commonsense knowledge can play in enhancing AI systems for private and public enterprises, Cyc’s strategy of steering the general knowledge base development in the direction of domain use cases can represent a sustainable business model for other stakeholders in the field.
Existing commonsense knowledge graphs make implicit categorizations of knowledge, by defining a tractable set of relations which can be traced to some types proposed in cognitive research. For instance, WebChildâs  part-of relations resemble the partonomic-metonimic relations in cognitive science literature (e.g., see [11, 64]), while ConceptNet  defines 34 relations, where the relation IsA can be often approximated with taxonomic knowledge. In its first version , ConceptNet defined 20 relations grouped into 8 categories: K-lines, Things, Agents, Events, Spatial, Causal, Functional, and Affective. Zhang et al.  extrapolate the types in the Conceptual Semantic Theory with those in ConceptNet 1.0 and propose the following six categories: property, object, eventuality, spatial, quantity, and others.
Beyond the structural differences, commonsense knowledge graphs share the same foundational elements. Commonsense knowledge is generally split into declarative and procedural, where the former is contained in unconditional assertions, and the latter requires conditional assertions
6.2 Consolidation efforts
In this paper, we analyzed individual knowledge graphs through suitable semantic dimensions, with the goal of providing insights on how consolidation of commonsense knowledge resources can be guided and, eventually, achieved. A natural extension of our work would be to evaluate ongoing efforts that adopts alternative methods of consolidation: accordingly, Framester , BabelNet , CSKG , and Predicate Matrix  constitute some of the most mature projects in this space. In Framester, several resources like WordNet, VerbNet, FrameNet, and BabelNet, are aligned using an OWL schema based on Description and Situations and Semiotics ontology design patterns.
7 Discussion and Roadmap
7.1 Summary of Findings
Commonsense knowledge sources use different levels of semantics, come in a variety of forms, and strive to capture diverse notions of common sense. After surveying 20 commonsense knowledge sources, we proposed that their relations can be grouped into 13 dimensions, namely: lexical, similarity, distinctness, part-whole, spatial, creation, utility, desire/goal, quality, comparative, temporal, and relational-other. Most relations can be unambiguously mapped to one of the dimensions.
We apply our dimensions to reorganize the knowledge in the Commonsense Knowledge Graph (CSKG) . Following our devised mapping of relations to dimensions, we add an additional column (relation;dimension) in CSKG, indicating the dimension of each edge. This allows us to make use of the consolidation of seven existing sources done by CSKG, and complement it with the dimensions in order to perform more abstract analysis of its knowledge types.
We designed and ran four experiments to analyze commonsense knowledge in CSKG through the lenses of these 13 dimensions.
In experiment 1, we investigated the coverage of the 13 dimensions in current sources. Some dimensions, like part-whole and similarity, are subject of interest in most sources. Others, like comparative knowledge and knowledge on desires/goals, are rarely captured. Yet, the depth of knowledge on the less commonly represented relations is still high, as illustrated by the 244 thousand desire/goal edges in ATOMIC. Here we also observed that the breadth of focus varies notably across sources, as some (e.g., ConceptNet and Wikidata-CS) cover a wide range of relations, while others (e.g., ATOMIC or WordNet) have a narrower focus.
Experiment 2 posed the question of whether individual knowledge statements are redundant across sources. Our experiments with four sources indicated, with few exceptions, that only a tiny portion of all edges were shared between a pair of sources. This experiment points to a two-fold motivation for node resolution over commonsense sources. On the one hand, node resolution is needed to increase the quality of computing overlap beyond the current lexical comparison of nodes. On the other hand, as the sources have generally complementary goals, it is likely that even with a more semantic computation the overlap will remain low. Node resolution is, thus, essential to consolidate different views of a node (concept) into a single representation.
In experiment 3, we cluster all edges in CSKG according to their dimension, and compare these clusters with clusters based on a language model encoding of each edge. We noted that the overall agreement between the dimensions and the language model-based clustering is relatively low, indicating that language models pay much attention to the edge nodes. However, individual correspondences were noted. Similarity and distinctness quite clearly dominated some of the RoBERTa-based clusters, while other clusters were consistently split between the dimensions of desire/goal and temporal knowledge. Interestingly, the clusters inferred from the RoBERTa embeddings often clustered nodes from different sources into a single cluster.
Finally, in experiment 4 we investigated the impact of the dimensions on a downstream reasoning task of commonsense question answering. We adopted with a recent idea  for pretraining language models with knowledge graphs. The best-scoring model in this paper, RoBERTa-large with marginal ranking loss, was fed with knowledge from one of our dimensions at a time, and evaluated in a zero-shot manner on two benchmarks testing broad (CommonsenseQA) and social (SocialIQA) commonsense reasoning. The experiments showed that social commonsense reasoning clearly benefits from temporal, quality, and desire/goal knowledge, whereas the CommonsenseQA benchmark benefits from broad knowledge from all dimensions. Certain dimensions, such as lexical knowledge, were relatively uninformative, as it can be expected that such knowledge has been already acquired by the language models at their initial training stage. While the extent of knowledge plays a role, adding more knowledge is not always beneficial, as the task performance depends on the alignment between the dimensions and the task. This motivates further work on automatic alignment between the task questions and the our dimensions.
This also motivates future work that attempts to evaluate the value of additional content additions related to particular dimensions aimed at improving certain kinds of tasks.
The goal of consolidating and applying commonsense knowledge is an ambitious one, as witnessed by decades of research on this topic. Our 13 dimensions are an effort to reorganize existing commonsense knowledge through unification of its knowledge types. We see this as a necessary, but not sufficient, step towards a modern and comprehensive commonsense resource. The dimensions could facilitate, or complement, several other aspects of this pursuit:
1. Node resolution While potentially controversial, the consolidation of commonsense knowledge relations into dimensions/knowledge types is achievable with careful manual effort. This is largely due to the relatively small number of relations in most current sources. Resolving nodes across sources is another key aspect of this consolidation, strongly motivated by experiment 2 of this paper. Sources capture complementary knowledge, whose combination is prevented by the lack of mappings between their nodes. As nodes are currently intended to represent various aspects of meaning: words, phrases, concepts, frames, events, sentences — their consolidation is not obvious. Moreover, the number of unique nodes in most sources is on the order of many thousands or even millions , preventing it from being solvable by mere manual effort. Node resolution could be framed as a ‘static’ disambiguation/clustering task, where each statement for a node is to be classified into one of its meanings, similar to . Here, the set of meanings can be either present (e.g., WordNet synsets) or dynamically inferred from the data. An alternative, ‘dynamic’ approach is defer the node resolution to task time and perform it implicitly as a task of retrieving evidence from a knowledge source . Another option is a combination of the static and the dynamic approaches.
2. Coverage and boundaries
At present, it is difficult to estimate the completeness of commonsense knowledge sources. With the relations organized into dimensions of knowledge, we gain insight into the volume of knowledge that falls within each of the dimensions. An ideal node resolution would take us one step further, allowing us to detect gaps, i.e., understand which relevant facts are not represented by any of the sources. If nodes are resolved to an ontology like WordNet, one could leverage its taxonomy to infer new information. For instance, ConceptNet is at present unable to infer that if barbecues are held in outdoor places, they could be, by extension, be held in a park or someone’s patio. In addition, a more semantic resource would allow us to define constraints over the knowledge and detect anomalies and contradictory knowledge, which can be argued to define the boundaries of the knowledge that can be obtained. It is reasonable that such boundaries exist, as commonsense knowledge is characterized by commonness of its concepts and restricted set of relations .
Further, by organizing by dimensions also allows us to describe strengths (and/or weakenesses) of resources. For example, a resource that has many partonomic relationships might be the first resource to consider using if a task requires part-whole reasoning.
3. Generalizable downstream reasoning As current large-scale commonsense sources are primarily text-based, they are lexicalized prior to their combination with language models, losing much of their structure. As this lack of structure prevents us from understanding their coverage and gaps, we are unable to measure their potential for downstream reasoning as a function of the available knowledge. It remains unknown to which extent a more complete source, organized around dimensions of commonsense knowledge, would be able to contribute to improve performance. Experiment 4 showed that there is correspondence between knowledge dimensions and question answering tasks, motivating automatic alignment between the two. Moreover, a comprehensive semantic source may inspire new neuro-symbolic reasoning methods, with potentially enhanced generalizability and explainability, opening the door for reliable commonsense services to be made available in the future.
4. Evaluation and knowledge gaps Experiment 4 showed that the potential of different dimensions for reasoning varies greatly and is largely dependent on the evaluation data. This finding is in line with . The fact that certain dimensions consistently contribute little can be an indicator for gaps in current evaluation. Namely, dimensions like distinctness and spatial which currently contribute little or not at all are likely to be underrepresented in current evaluations. These gaps should ideally be addressed in the future by new benchmarks that will represent these missing dimensions. We note that our set of dimensions is based on the relations found in current popular commonsense sources. Hence, in this paper, we make an assumption that the knowledge types in these sources suffice, or at least have previously sufficed, and can express the desired knowledge. The diversity of knowledge expressed by the relational-other dimension, as pointed out also in , might be an indicator for additional, latent dimensions hidden behind the vagueness of this dimension.
At present, commonsense knowledge is dispersed across a variety of sources with different foci, strengths, and weaknesses. The complementary knowledge covered by these sources motivates efforts to consolidate them under a common representation. In this paper, we pursued the goal of organizing commonsense relations into a shared set of knowledge dimensions in a bottom-up fashion. Starting from a survey and analysis of the relations found in existing sources, we grouped them into 13 dimensions: lexical, similarity, distinctness, part-whole, spatial, creation, utility, desire/goal, quality, comparative, temporal, and relational-other. As each relation in these sources can be mapped to a dimension, we applied our method to abstract the relations in an existing consolidated resource: the Commonsense Knowledge Graph (CSKG). This allowed us to empirically study the impact of these dimensions. First, we observed that some dimensions are included more often than others, potentially pointing to gaps in the knowledge covered in existing resources. Second, we measured sparse overlap of facts expressed with each dimension across sources, which motivates future work on graph integration through (automated) node resolution. Third, comparing the dimension-based clustering to language model-based unsupervised edge clustering resulted in low overall agreement, though in some cases, the unsupervised clusters were dominated by one or two dimensions. This showed that some of the dimensions represent a stronger signal for language modeling than others. Fourth, we measured the impact of each dimension on a downstream question answering reasoning task, by adapting a state-of-the-art method of pretraining language models with knowledge graphs. Here, we observed that the impact differs greatly per dimension, depending largely on the alignment between the task and the knowledge dimension, as well as on the novelty of knowledge captured by a dimension. While this is in accordance with the findings of the original method , the dimension-driven experiments of this paper enabled this hypothesis to be investigated much more precisely, revealing the direct impact of each knowledge dimension rather than entire knowledge sources.
Our experiments inspired a four-step roadmap towards creation and utilization of a comprehensive dimension-centered resource. (1) Node resolution methods should be introduced and applied to unify the resources further. (2) Such an integration would allow us to better understand and improve the coverage/gaps and boundaries of these sources. (3) A large-scale, public semantic graph of commonsense knowledge may inspire novel neuro-symbolic methods, potentially allowing for better generalization and explainability. (4) The impact of a dimension is an indicator of the coverage of that dimension in current evaluation benchmarks; under-represented dimensions are evaluation gaps that may need to be filled by introducing new benchmarks. And, vice-versa, additional knowledge dimensions might be hidden behind the generic relational-other dimension.
This material is based upon work sponsored by the DARPA MCS program under Contract No. N660011924033 with the United States Office Of Naval Research.
- This phenomenon can be seen on the benchmark leadearboards, which are dominated by ‘pure’ language models, for instance: https://leaderboard.allenai.org/socialiqa/submissions/public (accessed on January 5th, 2021).
- For brevity, we omit the word ‘digital’ in the remainder of this paper.
- Here we exclude implicitly comparative knowledge, such as the inferred information that eating food makes one more satisfied from the triple: PersonX eats food – xReact – satisfied.
- As discussed before, this assumption might not always hold in practice. Future work should attempt to refine this mapping, e.g., by crowdsourcing or by clustering algorithms.
- We leave out the relations prefixed with /r/dbpedia from ConceptNet, as these are being deprecated according to the official documentation: https://github.com/commonsense/conceptnet5/wiki/Relations.
- Python script: https://github.com/usc-isi-i2/cskg/blob/master/consolidation/compute_dimensions.py.
- If a node has more than one label, then we perform comparison based on the first one.
- Notebook: https://github.com/usc-isi-i2/cskg/blob/master/analysis/Overlap.ipynb.
- Notebook: https://github.com/usc-isi-i2/cskg/blob/master/embeddings/Summary%20of%20Dimension%20on%20CSKG.ipynb
- We omit question types with less than 20 questions for CSQA.
- According to the classic argument of the mind-body problem, it is inherently impossible to characterize how generalization occurs, due to an explanatory gap .
- Because our focus is on resources, it is beyond the scope of this paper to discuss seminal investigations on common sense axiomatization, such as Pat Hayes’ naive physics  and Ernest Davies’ work on qualitative commonsense reasoning.
- The distinction between these types of assertions was formalized in a seminal work by Gentzen 
- These patterns can be accessed at http://ontologydesignpatterns.org/wiki/Main_Page
CreateSpace Independent Publishing Platform.
Cited by: §6.1.
The berkeley framenet project.
In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1,
Cited by: §1,
Self-supervised knowledge triplet learning for zero-shot question answering.
arXiv preprint arXiv:2005.00316.
Cited by: §1.
ProtoQA: a question answering dataset for prototypical common-sense reasoning.
arXiv preprint arXiv:2005.00771.
Cited by: §1.
Psychology from an empirical standpoint.
Cited by: §6.1.
A multilingual predicate matrix.
In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16),
Cited by: §6.2.
Sweetening ontologies with dolce.
In International Conference on Knowledge Engineering and Knowledge Management,
Cited by: §3.2.
Investigations into logical deduction. translation printed in m. szabo the collected papers of gerhard gentzen.
Cited by: footnote 14.
An ecological approach to perceptual learning and development.
Oxford University Press, USA.
Cited by: §6.1.
Logic and conversation.
In Speech acts,
Cited by: §1.
Schema. org: evolution of structured data on the web.
Communications of the ACM 59 (2), pp. 44–51.
Cited by: §3.2.
The naive physics manifesto.
Expert systems in the microelectronic age.
Cited by: footnote 13.
Idealism and the problem of knowledge and existence.
In Proceedings of the Aristotelian Society,
Vol. 5, pp. 136–178.
Cited by: §6.1.
Hunger for contextual knowledge and a road map to intelligent entity linking.
In International Conference on Language, Data and Knowledge,
Cited by: §6.1.
CYC: a large-scale investment in knowledge infrastructure.
Communications of the ACM 38 (11), pp. 33–38.
Cited by: §1.
Materialism and qualia: the explanatory gap.
Pacific philosophical quarterly 64 (4), pp. 354–361.
Cited by: footnote 12.
Birds have four legs?! numersense: probing numerical commonsense knowledge of pre-trained language models.
arXiv preprint arXiv:2005.00683.
Cited by: §1.
ConceptNetâa practical commonsense reasoning tool-kit.
BT technology journal 22 (4), pp. 211–226.
Cited by: §6.1.
Framing and interpretation.
Melbourne University Press.
Cited by: §6.1.
Programs with common sense.
RLE and MIT computation center.
Cited by: §1.
The cognitive revolution: a historical perspective.
Trends in cognitive sciences 7 (3), pp. 141–144.
Cited by: §6.1.
Towards a standard upper ontology.
In Proceedings of the international conference on Formal Ontology in Information Systems-Volume 2001,
Cited by: §3.2.
Yago 4: a reason-able knowledge base.
The Semantic Web.
Cited by: §3.2.
Language models as knowledge bases?.
arXiv preprint arXiv:1909.01066.
Cited by: §3.5.
Language models are unsupervised multitask learners.
OpenAI Blog 1 (8), pp. 9.
Cited by: §3.5.
Cited by: §3.3.
Commonsense properties from query logs and question answering forums.
In Proceedings of the 28th ACM International Conference on Information and Knowledge Management,
Cited by: §3.1.
Social IQa: commonsense reasoning about social interactions.
In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),
Hong Kong, China, pp. 4463–4473.
Cited by: §1,
VerbNet: a broad-coverage, comprehensive verb lexicon.
Cited by: §3.3.
Eso: a frame based ontology for events and implied situations.
Proceedings of MAPLEX 2015.
Cited by: §6.2.
Open mind common sense: knowledge acquisition from the general public.
In OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”,
Cited by: §3.1.
CommonsenseQA: a question answering challenge targeting commonsense knowledge.
In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
Minneapolis, Minnesota, pp. 4149–4158.
Cited by: §1,
Stereotyping and bias in the flickr30k dataset.
In Proceedings of Multimodal Corpora: Computer vision and language processing (MMC 2016), J. Edlund, D. Heylen and P. Paggio (Eds.),
Cited by: §3.4.
Understanding stories with large-scale common sense..
Cited by: §1.
A taxonomy of part-whole relations.
Cognitive science 11 (4), pp. 417–444.
Cited by: §6.1.
Visual semantic navigation using scene priors.
arXiv preprint arXiv:1810.06543.
Cited by: §1.
WinoWhy: a deep diagnosis of essential commonsense knowledge for answering winograd schema challenge.
arXiv preprint arXiv:2005.05763.
Cited by: §6.1.