• #ACL2021NLP #ACL2021 Please check our group’s recent publication at the main conference of @aclmeeting. We uncovered a compositional generalization problem existing in NMT models and contributed a new dataset. Contributed by Yafu Li, Yongjing Yin, Yulong Chen, Yue Zhang.

  • Prof Yue Zhang leads the #NLP lab at Westlake University @Westlake_Uni. Our group focuses on machine learning-based natural language processing, as well as application-oriented tasks, such as web information extraction and financial market prediction. Welcome to join us!

  • #NLProc #ACL2021 G-Transformer for Document-level Machine Translation Paper:arxiv.org/abs/2105.14761 Code:github.com/baoguangsheng/ Our @aclmeeting paper at the main conference introduces locality bias to fix the failure of Transformer training on document-level MT data.


论文 Deep Talk 7个月前 (01-13) 63次浏览 已收录 0个评论 扫描二维码



Commonsense knowledge is essential for many AI applications, including those in natural language processing, visual processing, and planning. Consequently, many sources that include commonsense knowledge have been designed and constructed over the past decades.
Recently, the focus has been on large text-based sources, which facilitate easier integration with neural (language) models and application on textual tasks, typically at the expense of the semantics of the sources. Such practice prevents the harmonization of these sources, understanding their coverage and gaps, and may hinder the semantic alignment of their knowledge with downstream tasks. Efforts to consolidate commonsense knowledge have yielded partial success, but provide no clear path towards a comprehensive consolidation of existing commonsense knowledge.

The ambition of this paper is to organize these sources around a common set of dimensions of commonsense knowledge.
For this purpose, we survey a wide range of popular commonsense sources with a special focus on their relations. We consolidate these relations into 13 knowledge dimensions, each abstracting over more specific relations found in sources. This consolidation allows us to unify the separate sources and to compute indications of their coverage, overlap, and gaps with respect to the knowledge dimensions. Moreover, we analyze the impact of each dimension on downstream reasoning tasks that require commonsense knowledge, observing that the temporal and desire/goal dimensions are very beneficial for reasoning on current downstream tasks, while distinctness and lexical knowledge have little impact. These results reveal focus towards some dimensions in current evaluation, and potential neglect of others.




mode = title]Dimensions of Commonsense Knowledge

1]Filip Ilievski

2]Alessandro Oltramari

Kaixin Ma

Bin Zhang

Deborah L. McGuinness

Pedro Szekely


[cor1]Corresponding author

ommonsense knowledge /sepsemantics /sepknowledge graphs /sepreasoning

1 Introduction

Commonsense knowledge is information that humans
typically have that helps them make sense of everyday situations.
As such, this knowledge can generally be assumed to be possessed by most people, and, according to the Gricean maxims [21], it is typically omitted in (written or oral) communication.
The fact that common sense knowledge is often implicit presents a challenge for automated natural language processing (NLP) and question answering (QA) approaches as the extraction and learning algorithms cannot count on the common sense knowledge being available directly in text.

Due to its prominence and implicit nature, capturing commonsense knowledge holds a promise to benefit various AI applications, including those in NLP, computer vision, and planning. For instance, commonsense knowledge can be used to fill gaps and explain the predictions of a (neural) model [33], understand agent goals and causality in stories [63], or enhance robot navigation and manipulation [65].

Consequently, acquiring and representing commonsense knowledge in machine-readable form, as well as reasoning with it, has been a major pursuit of AI since its early days [40].
This has resulted in
the design, construction, and curation of a rich palette of
resources that include commonsense information (potentially along with other content)
like Cyc [30], ATOMIC [52], WebChild [60], ConceptNet [58], WordNet [41], FrameNet [2], and
Visual Genome [29]. Some of these, such as ConceptNet and Cyc, have been deliberately created to capture
information that would be useful for common sense-related reasoning tasks, while others, like WordNet or Visual Genome, were intended to support other tasks such as word sense disambiguation or image object recognition. As reported in [27], the commonsense sources exhibit large diversity in terms of their representation formats, creation methods, and coverage. While this reflects an opportunity for this knowledge to be exploited jointly, the inherent diversity makes the consolidation of these sources challenging.

Meanwhile, the last few years have featured a reinforced focus on benchmarks that evaluate different aspects of common sense, including social [53], physical [6], visual [66], and numeric [34] common sense. Further distinction has been made between discriminative tasks [53, 6, 59], where the goal is to pick the single correct answer from a list, and generative tasks, where one has to generate one or multiple correct answers [34, 7]. These tasks can be tackled by using the (entire or a subset of) training data [37, 33], or in a zero-/few-shot evaluation regime [38, 56].

The wealth and diversity of commonsense sources, on the one hand, and benchmarks, on the other hand, raises a natural question: what is the role of these knowledge repositories for real-world reasoning techniques that need to incorporate commonsense knowledge? While intuitively such sources of commonsense knowledge can have tremendous value on downstream reasoning tasks, the practice shows that their impact on these tasks has been relatively limited, especially in comparison to the contribution of the language models.1 Knowledge sources tend to have larger contribution when little or no training data is available: for example, one can generate artificial training sets based on several sources, which can be used to pre-train language models and apply them on downstream tasks without using the official training data [3, 38].
The impact of knowledge resources so far has been generally conditioned on the special cases where the knowledge and the task are known (in advance) to be well-aligned [38, 37].
While a variety of sources [52, 41, 58] or their combination [27] have been used to enhance language models for downstream reasoning, little is known about how this alignment between knowledge types and tasks can be dynamically achieved.

Most recent sources have focused on the breadth of knowledge, sometimes at the expense of its semantics [4, 43]. Text-based representations are particularly attractive, as they facilitate a more direct integration with language models, as well as reasoning on NLP and QA tasks. These sources are often treated as ‘corpora’, where each fact is typically lexicalized (manually or automatically) into a single sentence [37], which is used to inform or fine-tune a language model.
Due to the lack of focus on formal representational principles, the sources capture knowledge types which are not trivial to align with other sources, as shown by the sparse mappings available between these sources [27].
Considering the
lack of a common vocabulary and/or lack of alignment of these sources, their limited coverage, and lack of focus on explicit semantics,
knowledge is typically kept in an impoverished textual form that is easy to capture and combine with language models. The downsides of this practice are: 1) commonsense knowledge across sources remains difficult to harmonize; 2) without a thorough harmonization or consolidation, it is not clear how to effectively measure coverage, overlap, or gaps; and 3) text-based representations may be unable to capture the richness of contextual reasoning typically done by humans.

Efforts to consolidate commonsense knowledge across sources [27, 17, 44] have managed to bring these sources closer, which has shown impact on commonsense QA tasks [38]. In [26], we provide heuristics for defining the boundaries of commonsense knowledge, in order to extract such subset from one of the largest available graphs today, Wikidata [62]. Yet, these efforts have limited success, and many consolidation questions are left open. How should one think about commonsense knowledge in a theoretical way? What does it mean to build a consolidated knowledge graph (KG) of resources created largely in a bottom-up fashion? How should the relations be chosen? What is the right level of abstraction for relations and nodes?

2 Approach

The ambition of this paper is to provide insight into such questions, aiming primarily to organize the types of knowledge found in current sources of commonsense knowledge.

For this purpose, we survey a wide variety of sources of commonsense knowledge, ranging from commonsense KGs through lexical and visual sources, to the recent idea of using language models or corpora as commonsense knowledge bases. We survey their relations and group them into a set of dimensions, each being a cluster of its specific relations, as found in the sources. We then apply these dimensions to transform and unify existing sources, providing an enriched version of the Commonsense Knowledge Graph [27].
The dimensions allow us to perform four novel experiments:

  1. We assess the coverage of the sources with respect to each dimension, noting that some sources have wide (but potentially shallow) coverage of dimensions, whereas others have deep but narrow coverage. This supports the need to integrate these complementary sources into a single one.

  2. We benefit from the consolidation of the dimensions to compare the facts in the sources and compute metrics of overlap. The results show that there is little knowledge overlap across sources, even after consolidating the relations according to our dimensions, thus motivating future work on node resolution.

  3. We contrast the clusters according to our dimensions to language model-based clusters, to understand the similarities and differences in terms of their focus.

  4. We measure the impact of each dimension on two representative commonsense QA benchmarks. Following [38], we pre-train a language model and apply it on these benchmarks in a zero-shot fashion (without making use of the task training data). The dimensions provide a more direct alignment between commonsense knowledge and the tasks, revealing that some dimensions of knowledge are very helpful for a task, while others might even degrade model performance.

The contributions of the paper are as follows. 1) We survey existing sources of commonsense knowledge of a wide variety, with an emphasis on their relations.
We provide a categorization of those resources and include a short overview of their focus and creation methods (Section 3).
2) We analyze the entire set of relations and abstract them to a set of 13 commonsense dimensions. Each dimension abstracts over more specific relations, as found in the sources (Section 4). 3) The identified dimensions are applied to consolidate the knowledge in the Commonsense Knowledge Graph (CSKG), which integrates seven of the sources we analyze in this paper. The resulting resource is made publicly available (Section 5).
4) We make use of this dimension-based consolidation of CSKG to
analyze the overlap, coverage, and knowledge gaps of individual knowledge sources in CSKG, motivating their consolidation into a single resource (Sections 5.15.3). 5) We evaluate the impact of different dimensions on
two popular downstream commonsense reasoning tasks. The results show that certain dimensions, like temporal knowledge and knowledge on desires/goals are very beneficial and well-covered by benchmarks, whereas other dimensions like distinctness and lexical knowledge currently have little impact. These results reveal more precise alignment between dimensions in the resources and existing tasks, and point to gaps in both existing knowledge sources and in tasks (Section 5.4).
6) We reflect on the results of our analysis, and use it as basis to provide a roadmap towards building a more semantic resource that may further advance the representation of, and reasoning with, commonsense knowledge. Such a resource would be instrumental in building a general commonsense service in the future (Section 7).

3 Sources of Commonsense Knowledge

We define a digital commonsense knowledge source as a potentially multi-modal repository from which commonsense knowledge can be extracted.2 Commonsense knowledge sources come in various forms and cover different types of knowledge. While only a handful of sources have been formally proposed as commonsense sources, many others cover aspects of common sense. Here, we collect a representative set of sources, which have been either proposed as, or considered as, repositories of commonsense knowledge in the past. We categorize them into five groups, and describe the content and creation method of representative sources within each group.
Table 1 contains statistics and examples for each source.

Category Source Relations Example 1 Example 2
Commonsense KGs ConceptNet* 34 food – capable of – go rotten eating – is used for – nourishment
ATOMIC 9 Person X bakes bread – xEffect – eat food PersonX is eating dinner – xEffect – satisfies hunger
GLUCOSE 10 makes (that is food) Causes/Enables
WebChild 4 (groups) restaurant food – quality#n#1 – expensive eating – type of – consumption
Quasimodo 78,636 pressure cooker – cook faster – food herbivore – eat – plants
SenticNet 4 cold_food – polarity – negative eating breakfast – polarity – positive
HasPartKB 1 dairy food – has part – vitamin n/a
Common KGs Wikidata 6.7k food – has quality – mouthfeel eating – subclass of – ingestion
YAGO4 116 banana chip – rdf:type – food eating – rdfs:label – feeding
DOLCE* 1 n/a n/a
SUMO* 1,614 food – hyponym – food_product process – subsumes – eating
Lexical resources WordNet 10 food – hyponym – comfort food eating – part-meronym – chewing
Roget 2 dish – synonym – food eating – synonym – feeding
FrameNet 8 (f2f) Cooking_creation – has frame element – Produced_food eating – evoke – Ingestion
MetaNet 14 (f2f) Food – has role – food_consumer consuming_resources – is – eating
VerbNet 36 (roles) feed.v.01 – Arg1-PPT – food eating – hasPatient – comestible
Visual sources Visual Genome 42,374 food – on – plate boy – is eating – treat
Flickr30k 1 a food buffet – corefers with – a food counter a eating place – corefers with – their kitchen
Corpora & LMs GenericsKB n/a Aardvarks search for food. Animals receive nitrogen by eating plants.
GPT-2 n/a Food causes a person to be hungry and a person to eat. Eating at home will not lead to weight gain.

Table 1: Overview of commonsense knowledge sources. The asterisk (‘*’) indicates that the source is extended with WordNet knowledge. For FrameNet and MetaNet, we specify their numbers of frame-to-frame relations. WebChild contains a large number of relations, expressed as WordNet synsets, which are aggregated into 4 groups.

3.1 Commonsense Knowledge Graphs

ConceptNet [58] is a multilingual commonsense knowledge graph. Its nodes are primarily lexical and connect to each other with 34 relations. Its data is largely derived from the crowdsourced Open Mind Common Sense (OMCS) corpus [57]
, and complemented with knowledge from other resources, like WordNet.

ATOMIC [52] is a commonsense knowledge graph that expresses pre- and post-states for events and their participants in a lexical form with nine relations. Its base events are collected from a variety of corpora, while the data for the events is collected by crowdsourcing.

GLUCOSE [43] contains causal knowledge through 10 relations about events, states, motivations, and emotions. The knowledge in GLUCOSE is crowdsourced based on semi-automatic templates, and generalized from individual stories to more abstract rules.

WebChild [60] is a commonsense knowledge graph whose nodes and relations are disambiguated as WordNet senses. It captures 20 main relations, grouped in four categories. WebChild has been extracted automatically from Web information, and canonicalized in a post-processing step.

Quasimodo [51] contains commonsense knowledge about object properties, human behavior, and general concepts. Its nodes and relations are initially lexical and extracted automatically from search logs and forums, after which a notable subset of them has been clustered into WordNet domains.

SenticNet [10] is a knowledge base with conceptual and affective knowledge, which is extracted from text and aggregated automatically into higher-level primitives.

HasPartKB [5] is a knowledge graph of hasPart statements, extracted from a corpus with sentences and refined by automatic means.

3.2 Common Knowledge Graphs and Ontologies

Wikidata [62] is a general-domain knowledge graph, tightly coupled with Wikipedia, that describes notable entities. Its nodes and relations are disambiguated as Qnodes. The content of Wikidata is collaboratively created by humans, as well as other existing sources. Given the vast number of statements in Wikidata and its sizable set of over 7 thousand relations, we consider its Wikidata-CS commonsense subset, as extracted in [26].

YAGO [46] is a general-purpose knowledge graph, whose nodes and relations are disambiguated entities. The knowledge in YAGO is extracted automatically from Wikipedia, and consolidated with knowledge from other sources, like Schema.org [22]

DOLCE (Descriptive Ontology for Linguistic and Cognitive Engineering) [18] is an upper level ontology that captures the ontological categories underlying natural language and human common sense with disambiguated concepts and relations. It has been created manually by experts.

SUMO (Suggested Upper Merged Ontology) [45] is an ontology of upper-level disambiguated concepts and their relations. It has been created manually by experts.

3.3 Lexical Resources

WordNet [41] is a lexical database of words, their meanings, and taxonomical organization, in over 200 languages. It has been created manually by experts.

Roget [50] is a manually-created thesaurus that contains synonyms and antonyms for English words.

FrameNet [2] is a lexical resource that formalizes the frame semantics theory: meanings are mostly understood within a frame of an event and its participants that fulfill roles in that frame. FrameNet was created manually by experts.

MetaNet [15] is a repository of conceptual frames, as well as their relations which often express metaphors. It has been created manually.

VerbNet [54] is a resource that describes syntactic and semantic patterns of verbs, and organizes them into verb classes. It has been created manually by experts.

3.4 Visual Commonsense Sources

Visual Genome [29] contains annotations of concepts and their relations in a collection of images. The image descriptions are manually written by crowd workers, while their concepts are mapped automatically to WordNet senses and revised by crowd workers.

Flickr30k [48] annotates objects in 30k images by multiple workers. The expressions used by different annotators are clustered automatically into groups of coreferential expressions in [61].

3.5 Corpora and Language Models

GenericsKB [4] contains self-contained generic facts represented as naturally occurring sentences. The sentences have been extracted from three existing corpora, filtered by handwritten rules, and scored with a BERT-based classifier.

Language models, like RoBERTa [36] and GPT-2 [49], can be used as KBs[47] to complement explicitly stated information, e.g., as a link prediction system like COMET [8] or through self-talk [56].

3.6 Observations

As apparent in this section, the commonsense sources are based on a wide range of representation principles and have been created with different construction methods. Through the example scenarios of food and eating (Table 1), we show that they have notable overlap in terms of their covered (typically well-known) concepts. At the same time, the types of knowledge covered differ across sources: some sources provide truisms, such as feeding is done with food, while others speculate on usual properties of food, such as its capability to go rotten or often be on a plate. Furthermore, we observe that same or similar relations tend to have different names across sources (compare type of to subclass of or is; or has quality in Wikidata to cook faster in Quasimodo).

These distinctions make the integration of these sources, and the understanding of their coverage and gaps, very challenging. In order to integrate the knowledge in these sources, we next propose a consolidation of their relations into a common set of dimensions.

Dimension ATOMIC ConceptNet WebChild Other Wikidata
lexical DerivedFrom lexical_unit (FN) label
EtymologicallyDerivedFrom lemma (WN)
Synonym reframing_mapping (FN)
similarity SimilarTo hassimilar metaphor (FN)
DefinedAs Synonym (RG) said to be the same as
synonym (WN)
Antonym Antonym (RG) different from
distinctness DistinctFrom antonym (WN) opposite of
excludes (FN)
IsA perspective_on (FN) subClassOf
taxonomic InstanceOf hasHypernymy inheritance (FN) instanceOf
MannerOf hypernym (WN) description
PartOf HasPart (HP)
HasA physicalPartOf meronym (WN) has part
part-whole MadeOf memberOf holonym (WN) member of
AtLocation* substanceOf material used
AtLocation* location location
spatial LocatedNear spatial anatomical location
creation CreatedBy creator
utility UsedFor hassynsetmember using (FN) used by
CapableOf activity use
NotCapableOf participant uses
xIntent CausesDesire
xWant MotivatedByGoal
desire/goal oWant Desires
quality HasProperty color frame_element (FN)
NotHasProperty taste_property color
xAttr SymbolOf temperature has quality
comparative 6.3k relations
xNeed HasFirstSubevent subframe (FN)
xEffect HasLastSubevent time precedes (FN)
temporal oEffect HasSubevent emotion inchoative_of (FN)
xReact HasPrerequisite prev causative_of (FN) has cause
oReact Causes next has effect
RelatedTo field of this occupation
relational HasContext thing see_also (FN) depicts
-other EtymologicallyRelatedTo agent requires (FN) health specialty
Table 2: Knowledge Dimensions. Relations marked with ‘*’ describe knowledge in multiple dimensions. Relations with express negated statements.

4 Dimensions of commonsense knowledge

In the previous section, we surveyed 20 representative commonsense sources from five categories: commonsense KGs, common KGs, lexical, visual sources, and corpora and language models. A key contribution of this paper is a manual categorization (by the authors) of the kind of knowledge expressed by the relations in these sources into 13 dimensions.
Table 2 shows the correspondence of each relation in these analyzed sources to our dimensions. An example for each of the dimensions from different sources is shown in Table 3. We next describe each dimension in turn.

Lexical. Many data sources leverage the vocabulary of a language or the lexicon in their relations. This includes relationships such as plural forms of nouns, or past tenses of verbs, for example. Lexical knowledge also covers substring information. ConceptNet, for example, includes a relationship called DerivedFrom that they describe as capturing when a word or phrase appears within another term and contributes to that term’s meaning. Lexical knowledge is also the formalization of the relation between a concept and its expression in a language, e.g., denoted by through the label relation in Wikidata.

Similarity. Most data sources include the notion of synonymy between expressions, allow definitions of terms, or may cover a broader notion of just general similarity.
ConceptNet has all three subcategories – for instance, regarding similarity, it establishes that wholesome and organic food are similar notions, while eating is defined as process of taking in food. WebChild also captures similarity between WordNet concepts, while WordNet, Wikidata, and Roget focus on synonymy. For instance, Roget declares that food and edibles are synonyms, while Wikidata expresses that food is said to be the same as nutriment.

Distinctness. Complementary to similarity, most data sources have notions of some kind of distinguishability. Most commonly, this is formalized as antonymy, where words have an opposition relationship between them, i.e., they have an inherently incompatible relationship. For example, both Roget and ConceptNet consider hot and cold to be antonyms, as these are two exclusive temperature states of objects. FrameNet defines an Excludes relation to indicate that two roles of a frame cannot be simultaneously filled in a given situation. For instance, in the Placing frame, an event can either be brought by a cause event or by an intentional agent, but not both. Weaker forms of distinctness are defined by Wikidata and ConceptNet, for concepts that might be mistaken as synonyms. For example, Wikidata states that food safety is different from food security, while ConceptNet distinguishes food from drinks.

Dimension Example Source
lexical derivationaly related form: nutrient WordNet
etymologically related: fodder ConceptNet
derived term: foodie ConceptNet
similarity synonym: dish ROGET
said to be the same as: nutriment Wikidata
similar to: wholesome – organic ConceptNet
distinctiveness opposite of: non-food item Wikidata
distinct from: drink ConceptNet
different from: food safety – food security Wikidata
taxonomic hyponym: comfort food WordNet
hyponym: beverage WordNet
hypernym: substance WordNet
subclass of: disposable product Wikidata
part-whole things with food: minibar ConceptNet
is part of: life COMET
material used: food ingredient Wikidata
spatial is located at: pantry ConceptNet
is located at: a store ConceptNet
location: toaster – kitchen Wikidata
located near: plate Visual Genome
located near: table Visual Genome
creator is created by: cook COMET
is created by: plant COMET
utility use: eating Wikidata
used by: organism Wikidata
used for: pleasure ConceptNet
used for: sustain life COMET
used for: nourishment ConceptNet
capable of: cost money ConceptNet
capable of: go rotten ConceptNet
is capable of: taste good COMET
goal/desire xWant: watch movie together – get some food ATOMIC
desires: regular access to food ConceptNet
not desires: food poisoning ConceptNet
causes desire to: eat ConceptNet
xIntent: eats food – quit feeling hungry ATOMIC
motivated by: cook a meal ConceptNet
is motivated by: you be hungry COMET
quality xAttr: makes food – creative ATOMIC
has quality: shelf life Wikidata
has the property: tasty COMET
comparative healthier: home cooking – fast food WebChild
temporal has first subevent: cooking ConceptNet
starts with: open your mouth COMET
has effect: food allergy Wikidata
causes: you get full COMET
causes: indigestion COMET
relational-other related to: refrigerator ConceptNet
related to: cereal ConceptNet
field of work: food bank – food assistance Wikidata
main subject: cuisine – food product Wikidata
Table 3: Examples for food for each of the 13 dimensions. When the subject is different from food, we state it explicitly, e.g., xWant: watch movie together – get some food.

Taxonomic. Most data sources include a kind of arrangement classification where some objects are placed into more general and more specific groupings with inheritance relations. When those groupings are ordered categories based on generality, this captures the notion of hyponymy, indicating a subcategory relationship. Hyponymy blends the distinction between the relationships subclass / IsA (intended for two classes)
and InstanceOf (intended as a relation between an instance and a class). For instance, Wikidata states that a sandwich wrap is street food, or that food is a disposable product. WordNet has information that beverage and comfort food are hyponyms of food. While this dimension generally focuses on concepts (nouns), it also includes a specialization relation for verbs. Here, the MannerOf relation in ConceptNet states that wheezing is a manner of breathing.

Part-whole. Many data sources include a notion of being a part of or a member of something. Part-whole knowledge can be transitive, such as that of geographic containment, exemplified by New York City being a part of New York State, which is also part of the United states. Other part-of notions, such as member-of are not necessarily transitive. A third category of part-whole knowledge is expressed with the material or the building blocks of an object, such as food being made of food ingredients. A useful distinction between these three notions of part-whole: physical part of (sunroof – car), member of (musician – duet), and substance of (steel – boiler), is provided by WebChild. In addition, the importance of this commonsense dimension is shown by HasPartKB [5], which is an entire resource dedicated to part-whole relations.

Spatial. Spatial relations describe terms relating to or occupying space. This may entail indicating a usual location of a concept, as in the location property in wikidata or the AtLocation in ConceptNet. ConceptNet expresses locations for geographic entities, for example Boston is at location Massachusetts, as well as for things that can contain things: butter is at location refrigerator. Similarly to the latter case, Wikidata includes an example that toasters are located in kitchens. A weaker spatial relation is one of spatial proximity in WebChild or ConceptNet, specifying that, e.g., bikes are located near roads. While Visual Genome does not explicitly have a spatial relation, concepts occurring in the same image region can be represented with the LocatedNear relation [27]. Example such statements include food being located near a plate or a table.

This dimension describes the process or the agent that brought something into existence. ConceptNet gives an example that a cake is created by the bake process, COMET has information that food is created from plants, while Wikidata states that rifle factories create shotguns. Table 2 reveals that no other source has creation information.

Utility. This dimension covers a notion of fitness or usefulness of objects for some purpose.
ConceptNet’s relation UsedFor expresses knowledge that ‘the purpose of A is B’, with an example of food being used for pleasure or nourishment.
Wikidata has several similar relations: use, used by, and uses, which can express that platter is used for food presentation, or food is used by organisms.
ConceptNet includes the notion of CapableOf, described as ‘A is capable of B if A can typically do B’, like food being capable of going rotten, or knives being capable of cutting. Another related notion is that of receiving an action: a button may receive the push action. While a button does not have the sole purpose of being pushed, it is capable of receiving that action, and by inference, it may respond to the action.

Desire or goal. This dimension covers knowledge about agent desires or goals. An agent may want to have something or wish for something to happen. The agent typically has certain goals, aims, and/or plans, that may motivate or explain those desires.
The relation Desires in ConceptNet may indicate, e.g., that a person desires regular access to food. Its negated version, NotDesires expresses that a person does not desire poisoned food. ATOMIC has two relations: xWant and oWant, to indicate the desires of an agent or other agents in a given situation. For instance, when people watch a movie together, they want to get some food.
Regarding goals, ConceptNet includes the MotivatedByGoal and ObstructedBy relations to indicate the motivation and the constraint for a certain action. For instance, ConceptNet indicates that one’s sleep is obstructed by noise, while COMET’s extension of ConceptNet posits that people cook a meal because they are hungry.

Quality. Commonsense sources typically describe attributes of an agent or qualities related to an object.
For example, ConceptNet and COMET include the relation HasProperty, to express knowledge like ice having the property cold and the food has property tasty.
ATOMIC uses xAttr to indicate that, for example, the person that cooks food often has the attribute hungry or creative. WebChild and Wikidata both provide more specific qualities, such as taste, temperature, shape, or color. For instance, WebChild would specify the plant color as green.

Comparative. WebChild performs comparison of objects based on relative values for their attributes. Example comparative relations in WebChild are: healthier than (home cooking – fast food), faster than (car – bike), and larger than (lion – hyena). Notably, no other source describes comparative knowledge explicitly.3

Temporal. Most sources have notions of time that may support ordering by time and/or may capture relations that one thing is a prerequisite for another or one thing may have a particular effect.
ConceptNet, for example, expresses that the first event of eating may be cooking, while the last one could be getting rid of the containers. COMET states that eating starts with opening one’s mouth. More strongly, the temporal relations often indicate relative ordering of two events, through relations of causation and effects, such as food potentially causing allergy or indigestion. Such causal knowledge is found in ATOMIC, ConceptNet, COMET, WebChild, and Wikidata.

Relational-other. Conceptual and context-related relationships are often underspecified. On the one hand, increasingly some sources capture description of the circumstances that for the setting for a statement, event, or idea. ConceptNet has a single relation HasContext for this, while Wikidata has more concrete contextual relations, such as field of this occupation, depicts, and health specialty. This allows Wikidata to express that the main subject of a cuisine is a food product, and that the field of work of food banks is food assistance.
On the other hand, most of the knowledge in ConceptNet belongs to a generic relation called RelatedTo that may be used to capture a relatively vague semantic connection between two concepts, such as food being related to refrigerator or cereal.

Our organization of existing relations into 13 dimensions provides a unified framework to reorganize and consolidate these sources. Here, we discuss two nuances of our process. First, we placed the negative statements (marked with in Table 2) in the same dimension as the positive ones, as they cover the same knowledge type, despite having a different polarity and, arguably, purpose. Following a similar line of reasoning, we also placed inverse relations, such as used for and uses, in the same dimension. Second, we recognize that the underlying data may not always be clearly placed in one of these dimensions. For instance, the relation AtLocation, which intuitively should belong to the spatial category, contains some statements that express part-whole knowledge.

5 Experiments

Seven of the sources covered in the previous section: ConceptNet, ATOMIC, Visual Genome, WordNet, Roget, Wikidata-CS, and FrameNet, have been integrated together in the Commonsense Knowledge Graph (CSKG) [27]. We start with CSKG and apply our dimension classification (section 4) to its sources, under an assumption that each of their edge relations can be mapped unambiguously to one of the dimensions.4 As a result, each edge in CSKG has dimension information stored in its relation;dimension column.5 CSKG contains knowledge for 12 out of our 13 dimensions – the dimension comparative is not represented, as its only source, Web Child, is currently not part of CSKG. The resulting file is publicly available at: http://shorturl.at/msEY5.

This enrichment of the CSKG graph allows us to study the commonsense knowledge dimensions from multiple novel perspectives.
We investigate the following questions:

  1. How well is each dimension covered in the current sources? Here we compute the number of edges for each dimension across sources.

  2. Is knowledge redundant across sources?
    In experiment 2, we use the dimensions to quantify overlap between sources with respect to individual edges.

  3. How do the dimensions of the edges compare to their language model (LM) encodings?
    Experiment 3 computes clusters based on our dimensions and compares them to clusters computed with Transformer-based language models, like BERT [14] and RoBERTa [36].

  4. What is the impact of each dimension for reasoning on QA tasks?
    Each of the dimensions is used to select a subset of the available knowledge in CSKG. The selected knowledge is then used to pretrain a RoBERTa language model, which is applied to answer commonsense questions in a zero-shot manner.

In this section, we formulate and run suitable studies for each of the four questions, and reflect on the results.

Dimension ATOMIC ConceptNet WebChild ROGET Wikidata-CS WordNet FrameNet
lexical 704 0.5 207 14
similarity 255 343 1,023 1 152 0.4
distinctness 22 381 7 4
taxonomic 244 783 73 89 23
part-whole 19 5,752 8 22
spatial 28 660 0.5
creation 0.3 0.2
utility 69 2,843 2 1
desire/goal 244 20
quality 143 9 6,510 1 11
comparative 813
temporal 346 71 2,135 3 0.6
relational-other 1,969 291 6 0.7
Table 4: Coverage of sources in terms of the knowledge dimensions. The numbers presented are in thousands.

5.1 Experiment 1: How well is each dimension covered in the current sources?

We use the CSKG graph enriched with edge dimensions to compute source coverage with respect to each dimension. The coverage of each source, formalized as a count of the number of edges per dimension, is presented in Table 4.6

We observe several trends in this Table. First, there is much imbalance between the number of sources per dimension. Comparative knowledge and creation information are very rare and are described by only one or two sources, whereas taxonomic, temporal, and similarity knowledge are much more common and are captured by most sources. Second, some of the dimensions, like creation or part-whole, are represented with relatively few edges, whereas similarity and taxonomic knowledge generally have much larger number of edges. The exception for the former is the large number of part-whole statements in WebChild, which is due to the fact that WebChild is automatically extracted, resulting in many duplicates and noisy information. Third, we see that some sources, like ConceptNet, FrameNet, and Wikidata-CS, aim for breadth and cover most dimensions. Others, like Roget and ATOMIC, have a narrow focus on specific dimensions: primarily desires/goals and temporal knowledge in ATOMIC, and only knowledge on similarity and distinctness in Roget. Yet, the narrow focus generally coincides with much depth, as both sources have many edges for the small set of dimensions that they cover. FrameNet, having a broad focus, has a small number of edges for each dimension due to its limited coverage of lexical units. Again here, WebChild is a notable outlier with a large number of automatically extracted statements for most dimensions. Finally, we observe different ratios between ‘strong’ and ‘weak’ semantic relations across sources. Most of ConceptNet’s knowledge falls under the generic relational-other category, whereas only a small portion of Wikidata-CS belongs to the same dimension. Most of Wikidata-CS is taxonomic knowledge.

Source pair Relations Dimensions
CN – RG 57,635 (1.23%) 73,992 (1.60%)
CN – WD 2,386 (0.07%) 2,623 (0.08%)
CN – WN 86,006 (2.14%) 97,946 (2.60%)
RG – WD 299 (0.02%) 333 (0.02%)
RG – WN 75,025 (3.55%) 75,025 (3.93%)
WD – WN 1,697 (0.19%) 1,704 (0.25%)
Table 5: Overlap between various source pairs, based on the original relations (Relations) or the abstracted dimensions (Dimensions). Absolute overlap numbers are accompanied in brackets by the Jaccard percentage of the overlap against the union of all triples in the two sources.
Sources part-whole taxonomic lexical distinctness similarity quality utility creation temporal rel-other
CN-RG 4,639 69,353
(1.17) (5.79)
CN-WD 68 1,888 20 266 102 0 14 0 1 264
(0.25) (0.62) (0.00) (1.00) (0.04) (0.00) (0.02) (0.00) (0.00) (0.01)
CN-WN 4,710 73,123 1,053 19,060
(4.10) (15.19) (4.65) (5.05)
RG-WD 206 127
(0.05) (0.01)
RG-WN 3,300 71,725
(0.87) (6.50)
WD-WN 82 1,533 63 26
(0.07) (0.39) (0.62) (0.02)
Table 6: Overlap distribution across dimensions. Absolute overlap numbers are accompanied in brackets by the Jaccard percentage of the overlap against the union of all triples in the two sources. ’-’ indicates that at least one of the sources does not use the dimension.

5.2 Experiment 2: Is knowledge redundant across sources?

Our analysis so far reveals that most dimensions are covered by more than one source. This leads us to the next question: how often is a statement found in multiple sources?

Computing edge overlap between sources is conditioned on identity mapping between their nodes and relations.
While CSKG provides such identity mappings between some of its nodes, this cannot be expected to be complete. We align the edges as follows.
The nodes across sources are naively compared through their labels.7 Regarding the relations, we benefit from the CSKG principle of normalizing the relations across sources to a shared set. With this procedure, a WordNet edge (food.n.01, synonym, dish.n.01) is modelled as (food, /r/Synonym, dish) in CSKG.
As a dimension-based enhancement, we abstract each relation further by mapping it to our dimensions, e.g., transforming (food, /r/Synonym, dish) to (food, similarity, dish). This dimension-based transformation allows for more flexible matching within a dimension, for instance, enabling similarity and synonymy statements to be compared for equivalence, since both (food, /r/Synonym, dish) and (food, /r/SimilarTo, dish) would be normalized to (food, similarity, dish).

We apply the relation-based and dimension-based variants to compute overlap between four sources: ConceptNet, Roget, Wikidata, and WordNet, in terms of each dimension. Here we do not consider ATOMIC or FrameNet, as their edges can be expected to have extremely low lexical overlap with the other sources. The overlap is computed as a Jaccard score between the number of shared triples between two sources and the union of their triples. The obtained scores are given in Table 5.8 We observe that the overlap is generally low, yet, translating the original relations into dimensions constantly leads to an increase of the overlap for any of the source pairs. The highest relative overlap is between Roget and WordNet (3.93%), with ConceptNet-WordNet coming second (2.60%). The lowest overlap is obtained between the sources Roget and Wikidata (0.02%).

Next, we inspect the overlap between these sources per dimension: Will the edges that correspond to more commonly found dimensions (e.g., part-whole, see Experiment 1) occur more often in multiple sources? We provide insight into this question in Table 6. Primarily, this Table reveals that there is very little edge overlap across sources. As hypothesized, most of the shared edges belong to dimensions that are common in many commonsense sources, such as taxonomic, similarity, and part-whole. The highest Jaccard score is obtained on the taxonomic knowledge between ConceptNet and WordNet, followed by similarity knowledge in ConceptNet-Roget and Roget-WordNet.
Wikidata and ConceptNet share edges that belong to a number of other dimensions, including distinctness, similarity, and rel-other.

The sparse overlap in Tables 56 is amplified by our lexical method of computing overlap, as same or similar nodes may have slightly different labels. Both the low overlap and the relatively weak comparison method strongly motivate future work on node resolution of commonsense KGs.

5.3 Experiment 3: How do the dimensions of the edges compare to their language model encoding?

Next, we investigate how the information captured by our dimensions relates to the encoding of edges by state-of-the-art Transformer-based language models, like BERT or RoBERTa.9 For this purpose, we cluster the knowledge in CSKG according to our 13 dimensions, resulting in 13 disjoint clusters. We also compute clusters based on language models in an unsupervised manner as follows. Each of the edges is lexicalized into a natural language sentence by relation-specific templates. Each sentence is then encoded with a Transformer model, either BERT-large or RoBERTa-large, into a single 1,024-dimensional embedding. These embeddings are finally clustered with the k-Means [23] algorithm into disjoint clusters.

The two approaches for computing clusters, based on our dimensions and based on Transformer embeddings, can now be compared in terms of their agreement. We use the adjusted rand index (ARI) metric to measure the agreement.10 The ARI score is 0.226 for BERT and 0.235 for RoBERTa. These scores signal low agreement between the dimension-based and the unsupervised clustering, which is expected given that the dimension-based clustering entirely depends on the relation, while the unsupervised clustering considers the entire triple. We also observe that the ARI score of RoBERTa is slightly higher, which might indicate that the relation has higher impact on the embedding in RoBERTa than in BERT.

Figure 1: UMAP clusters of RoBERTa.

We pick a random sample of 5,000 edges. To understand the information encoded by RoBERTa, we visualize its k-means clusters with UMAP (Figure 1).
Curiously, certain clusters are clearly delineated, while others are not. For instance, cluster 5 has little overlap with the other clusters. Looking into the contents of this cluster, we observe that it is largely dominated by distinctness information: 92% (360 out of 390) of its edges belong to this dimension, mostly expressed through the /r/Antonym relation. Clusters 4, 7, and 8 are largely dominated by similarity, while clusters 1 and 6 are largely split between temporal (46%) and desire/goal (36%) edges. At the same time, we observe a lot of overlaps between the clusters 0, 2, 9, 10, 11, and 12. These clusters are dominated by lexical and relational-other edges – e.g., around half of all edges in clusters 0 and 9 belong to the category relational-other. The node frequency distributions reveal that cluster 1 describes positive emotions, as its most frequent node is /c/en/happy; nodes in cluster 5 are often numbers, like rg:en_twenty-eighth; and the nodes in cluster 9 describe concepts from natural sciences, the most connected node being /c/en/zoology.

Figure 2: UMAP clusters according to our dimensions.
RoBERTa cluster Dimension cluster Jaccard
5 distinctness 0.916
8 similarity 0.452
6 temporal 0.322
7 similarity 0.295
6 desire/goal 0.283
1 desire/goal 0.258
4 similarity 0.210
0 lexical 0.205
1 temporal 0.202
12 relational-other 0.182
Table 7: 10 pairs with the highest Jaccard scores between dimension-based and RoBERTa-based clusters.
0 lexical (0.205), rel-other (0.121), taxonomic (0.060)
1 desire (0.258), temporal (0.202), quality (0.087)
2 lexical (0.133), rel-other (0.122), taxonomic (0.075)
3 spatial (0.119), lexical (0.061), quality (0.053)
4 similarity (0.21), quality (0.015), lexical (0.008)
5 distinctness (0.916), lexical (0.018), taxonomic (0.003)
6 temporal (0.322), desire (0.283), quality, (0.054)
7 similarity (0.295), quality (0.009), taxonomic (0.005)
8 similarity (0.452), lexical (0.004), taxonomic (0.003)
9 rel-other (0.143), lexical (0.085), taxonomic (0.081)
10 rel-other (0.169), taxonomic (0.015), quality (0.007)
11 rel-other (0.11), quality (0.068), taxonomic (0.066)
12 rel-other (0.183), taxonomic (0.059), spatial (0.053)
Table 8: Top-3 highest-scored dimensions for each of the automatically-computed clusters. Bold-faced results indicate the top score for each dimension.

In Figure 2, we visualize the same set of edges, only this time we color each according to their dimension. In accordance with the relatively low rand index score, we observe that the clusters are mostly not well-distinguished from one another. We look for correspondences between the RoBERTa clusters in Figure 1 and the dimension-based clusters in Figure 2, by computing Jaccard score between the edges that constitute each pair of their clusters. The 10 cluster pairs with the highest Jaccard scores are shown in Table 7. We observe the highest correspondence for distinctness with cluster 5 (Jaccard score of 0.92) and similarity with cluster 8 (score 0.45). The overlapping clusters for desire and temporal knowledge both map relatively strongly to clusters 1 and 6, which appear nearby in the RoBERTa clustering as well. We also observe relatively high score for the cluster pairs 4-similarity, 7-similarity, and 0-lexical, all of which confirm our prior analysis of Figure 1.

We show the top-3 highest-scored dimensions for each of the automatically-computed clusters in Table 8. Here, we observe that some dimensions, like creation, utility, and part-whole do not fall within the top-3 dimensions for any of the RoBERTa clusters. This is likely due to the lower number of edges with these dimensions, as well as their dispersion across many RoBERTa clusters.

Figure 3: UMAP clusters for two selected nodes: /c/en/food and /c/en/eat.

To investigate further, we select the two CSKG nodes we considered earlier in Table 3: /c/en/food and /c/en/eat, and visualize all their edges according to their dimension. The total set of 2,661 edges largely belongs to the dimension relational-other (1,553), followed by temporal (319 edges), and desire/goal (228 edges), whereas no edge belongs to the creation dimension. Thus, most of the well-specified edges about nutritional concepts express either temporal information about the process, or knowledge about desires/goals relating to nutrition.
Within a single dimension, most relational-other edges are expressed with RelatedTo (1,488 out of 1,553). Temporal knowledge is split into multiple relations, primarily HasLastSubevent (113 edges), HasPrerequisite (69), and HasSubevent (60). Desire/goal divided into at:xWant (78 edges), MotivatedByGoal (47), and at:xIntent (47 edges).
Besides the two seed nodes (/c/en/food and /c/en/eat), the frequency distribution of nodes in a cluster reveals other prominent nodes. Naturally, the spatial dimension includes /c/e/plate (with an edge degree of 3), the temporal cluster includes /c/en/diminish_own_hunger (degree of 4), and the distinctness cluster has 7 edges for /c/en/drink.

5.4 Experiment 4: What is the impact of each dimension for reasoning on QA tasks?

Experiment 3 revealed overlaps between the information captured by our dimensions and that captured by language models. In our fourth experiment, we experiment with enhancing language models with knowledge belonging to individual dimensions, in order to examine the effect of different dimensions of knowledge on commonsense reasoning tasks.

We adopt the method proposed by [38] to pretrain state-of-the-art language models and conduct zero-shot evaluation on two commonsense question answering tasks. According to this method, we first transform ConceptNet, WordNet, Wikidata, ATOMIC, and Visual Genome into synthetic QA sets. We use templates to map each triple into a QA pair and apply random sampling with heuristic-based filtering to collect two distractors for every QA pair. We group synthetic QA pairs based on their dimension, resulting in 12 dimension-based QA buckets in total. Within each dimension, the QA data is split into training and development sets. For ATOMIC, we adopt its original split to partition data. For the other knowledge graphs, we partitioned 95% of data into training set and remaining 5% for devevelopment, following [38]. It is worth noting that [38] only selected 14 relations from ConceptNet, WordNet, Wikidata, and Visual Genome, whereas we include all relations except RelatedTo. The statistics for the synthetic QA sets are shown in table 9. We can see that the distribution of knowledge across dimensions is fairly skewed, with creation having very few questions, while taxonomic and temporal knowledge being the most numerous. Our experiments in this section will reveal whether the amount of available knowledge affects downstream task performance.

Dimensions Train Dev
part-whole 87,765 4,620
taxonomic 340,609 17,927
lexical 107,861 5,677
distinctness 20,286 1,068
similarity 166,575 8,768
quality 116,593 12,492
utility 63,862 3,362
creation 304 17
temporal 312,628 31,587
relational-other 242,759 12,777
spatial 21,726 1,144
desire/goal 194,906 20,912
Table 9: Statistics of the number of QA pairs for each dimension.

We pretrain the RoBERTa-large [36] model on each of the dimensions using the corresponding synthetic QA set. We use RoBERTa with a marginal ranking objective, as this is the best combination according to [38].
We use the same set of hyper-parameters as in [38], except for the creation dimension. Specifically, we train our models for epoch using learning rate , batch size and margin . For the creation dimension, since the number of samples is much smaller, we train the model for epochs while keeping the other hyper-parameters fixed.
We evaluate our models on two tasks: the CommonsenseQA task [59], in which the model is asked to choose the correct answer from five options given only the question, and SocialIQA task [53], in which the model chooses the correct answer from three options given a question and a brief context.

The results from our experiments are shown in Table 10.
Overall, we can see that with the additional pretraining on transformed knowledge graphs, the models are able to outperform the no-knowledge baseline on all dimensions. However, the variance of the improvement across dimensions is relatively large, revealing that certain dimensions are more relevant for downstream tasks than others. For example, although the training set size of lexical dimension exceeds 107K, its performance gain on both tasks is limited. We think that this is because the language model already learned most of the lexical knowledge from pretraining on unstructured text corpora, thus it could not benefit much from additional training. While the quality dimension has a similar training set size as the lexical dimension, the model benefits from it by a large margin on both tasks: 20.7 and 12.7 absolute points, respectively. This finding suggests that quality knowledge is novel and useful, as it may not be easy to learn from unstructured text.

Also, we note that downstream tasks benefit more from the knowledge dimensions that align with their question types. For example, each question in SIQA corresponds to an ATOMIC relation, and requires knowledge primarily about the order of events, personal attributes, and agent desires. Consequently, pretraining on quality, temporal, and desire/goal knowledge provides the model with the largest gain on SIQA task. The results of the temporal dimension are even higher than training on the entire set of questions, suggesting that certain knowledge dimensions that are not related to SIQA may even lead to a decline in model performance. For the CSQA task, since it is derived from the broad set of knowledge dimensions covered in ConceptNet, we expect that many (if not all) of these dimensions would help performance. Accordingly, we observe large gains with many of the knowledge dimensions (+15%), whereas the utility dimension yields the best performance (+22.4%), even slightly better than that of the entire set.

Dimensions CSQA SIQA
Baseline 45.0 47.3
Table 10: Results of zero-shot evaluation on two commonsense reasoning tasks. We run every experiment 3 times with different seeds, and report mean accuracy with a 95% confidence interval.

To better understand the impact of every dimension of knowledge on different questions, we further break down the performance of the models on every question type. Specifically, for CSQA, we classify questions based on ConceptNet relations between the correct answer and the question concept, as the model needs to reason over such relations to be able to answer the question. For SIQA, since the questions are initially generated using a set of templates based on ATOMIC relations, we try to reverse-engineer the process by manually defining a mapping from question format to ATOMIC relations. Using this method, we are able to successfully classify more than 99% of questions in SIQA dev set. Then we compute the averaged accuracy over 3 seeds for every question type for models trained on each dimension. The results for CSQA are shown in Figure 4 and results for SIQA are shown in Figure 5.11

For some questions types of CSQA, (one of) the largest improvements is achieved on the corresponding knowledge dimension. For example, temporal on Causes and desire/goal on Desires. However, in other cases, the accuracy boost brought about by the corresponding knowledge dimension is significantly lower than other dimensions, for example, distinctness on Antonym compare to utility. This might be a signal that the knowledge represented within these dimensions might not be clearly separated. For SIQA, the results show that the corresponding knowledge dimension is more clearly helping for most question types: desire/goal on xWant, xIntent, quality on xAttr and temporal on xNeed, oReact, oEffect. This is especially visible for xIntent and xNeed, where very little gain is observed for most knowledge dimensions except for the corresponding dimension, suggesting that the alignment between the questions and knowledge dimensions is important for the model’s success. We note that a similar finding on the alignment between knowledge and the task has been reported in the original paper [38]; yet, the dimensions allow us to validate this claim more precisely.

Figure 4: Accuracy for each question type in CSQA, where AtL. means AtLocation, Cau. means Causes, Cap. means CapabelOf, Ant. means Antonym, H.Pre. means HasPrerequisite, H.Sub. means HasSubevent, C.Des. means CauseDesires, Des. means Desires, P.Of means ParfOf, M.Goal means MotivatedByGoal, H.Pro means HasProperty. The numbers in parentheses indicate how many questions fall into the category.
Figure 5: Accuracy for each question type in SIQA. The numbers in parentheses indicate how many questions fall into that category.

Finally, to verify our hypothesis that certain dimensions of knowledge are already learned by the state-of-the-art language models to a large extent, while others are not, we directly evaluate the baseline LM on the synthetic QA sets for each dimension. The results are shown in table 11. As expected, even without any training, the model already achieves a very high accuracy (over 90%) on the lexical dimension, thus it could not receive much training signal from this dimension. On the other hand, the accuracy on quality, temporal, and desire/goal dimensions are significantly lower. This is mostly because questions from ATOMIC take a large portion in these dimension. As reported in [38], the questions from ATOMIC are more challenging than those created from other knowledge resources. We note that the accuracy for relational-other is the lowest among all dimensions. We hypothesize that this is because this dimension is more noisy than others and the knowledge in this dimension is less likely to be found in the unstructured text. We leave further investigation on this issue for future research.

In summary, we observe that certain dimensions of knowledge are very beneficial and novel for language models, allowing them to improve their performance on downstream reasoning tasks. Other dimensions, like lexical knowledge, are almost entirely redundant, as the language models have already acquired this knowledge during their initial training. The exact contribution for each dimension depends on the knowledge required by the task at hand. We discuss the implications of the obtained results further in Section 7.

Dimensions Dev
part-whole 67.5
taxonomic 57.0
lexical 90.1
distinctness 77.3
similarity 65.6
quality 45.5
utility 67.9
creation 82.4
temporal 47.2
relational-other 37.6
spatial 56.9
desire/goal 48.0
Table 11: Results of zero-shot evaluation of RoBERTa on the synthetic QA sets.

6 Related Work

6.1 Epistemic Foundations

World knowledge is what people learn and generalize from physical and social experiences, distilling mental representations out of the most significant aspects of everyday life.12 In these terms, common sense can be conceived as the partition of world knowledge that is commonly shared by most people. This definition, however, has intrinsic limitations: in fact, the scale and the diversity of physical and social experiences, as well as the ecological, context-dependent nature of what constitutes ‘common’ and ‘uncommon’ [20], make it hard to formulate any abstract criterion of what should fall under common sense knowledge.
From Aristotle’s theory of categories [1] to Brentano’s empirical psychology [9], deriving knowledge dimensions from empirical observations, as opposed to from abstract criteria [25], is a fundamental epistemic approach that contributed to the birth of Cognitive Science as a discipline [42], besides serving as reference framework for our current investigation.
In this article, in fact, we neither propose nor adopt any a priori principle to define what should be included in a common sense knowledge graph; rather, we analyze multiple knowledge graphs and, supported by empirical methods and experimental validations, elicit their most salient conceptual structures.

In the history of general knowledge base systems, the difficulty of characterizing commonsense has been a driver, rather than an obstacle. For instance, Cyc [16], the most monumental effort to construct an axiomatic theory of common sense knowledge, has been actively growing for almost forty years. At present, the Cyc knowledge base comprises around 1.5 million general concepts, and 25 million rules and assertions about these general concepts. Different domain-specific extensions of Cyc exist, funded by industry and government programs: considering the key role that commonsense knowledge can play in enhancing AI systems for private and public enterprises, Cyc’s strategy of steering the general knowledge base development in the direction of domain use cases can represent a sustainable business model for other stakeholders in the field.13
Existing commonsense knowledge graphs make implicit categorizations of knowledge, by defining a tractable set of relations which can be traced to some types proposed in cognitive research. For instance, WebChild’s [60] part-of relations resemble the partonomic-metonimic relations in cognitive science literature (e.g., see [11, 64]), while ConceptNet [58] defines 34 relations, where the relation IsA can be often approximated with taxonomic knowledge. In its first version [35], ConceptNet defined 20 relations grouped into 8 categories: K-lines, Things, Agents, Events, Spatial, Causal, Functional, and Affective. Zhang et al. [67] extrapolate the types in the Conceptual Semantic Theory with those in ConceptNet 1.0 and propose the following six categories: property, object, eventuality, spatial, quantity, and others.

Beyond the structural differences, commonsense knowledge graphs share the same foundational elements. Commonsense knowledge is generally split into declarative and procedural, where the former is contained in unconditional assertions, and the latter requires conditional assertions14: we can state, for instance, that windows are typically made of glass (declarative), and assert that if a large rock is thrown against a window, the glass typically breaks (procedural). As the use of the adverb typically suggests, commonsense knowledge rules out exceptions from the context of interpretation: for instance, bulletproof glass doesn’t break when hit by a rock. According to [39], four types of contextual knowledge are essential for humans to interpret or frame in text: intratextual, intertextual, extratextual, and circumtextual knowledge. These generic knowledge types, which are orthogonal to the declarative/procedural distinction, are relevant for most natural language understanding tasks. In [28], we analyzed these types for the task of entity linking; when it comes to commonsense question answering, they might provide a guide for extending the coverage of the knowledge, when combined with specific theories/axioms, such as those defined by the resources in section 3.2 or in Cyc.

6.2 Consolidation efforts

In this paper, we analyzed individual knowledge graphs through suitable semantic dimensions, with the goal of providing insights on how consolidation of commonsense knowledge resources can be guided and, eventually, achieved. A natural extension of our work would be to evaluate ongoing efforts that adopts alternative methods of consolidation: accordingly, Framester [17], BabelNet [44], CSKG [27], and Predicate Matrix [13] constitute some of the most mature projects in this space. In Framester, several resources like WordNet, VerbNet, FrameNet, and BabelNet, are aligned using an OWL schema based on Description and Situations and Semiotics ontology design patterns.15 CSKG, which we leverage in this paper, is also based on a schema, but it doesn’t rely on traditional RDF/OWL semantics: in fact, CSKG is a hyper-relational graph represented in a tabular format, designed to preserve individual knowledge structures of resources like ConceptNet, WebChild, Visual Genome, etc., exploit direct mappings when available, derive indirect mappings when possible (e.g., while ConceptNet and Visual Genome do not have direct connections, they both have mappings to WordNet), and infer links through statistical algorithms. BabelNet is a multilingual lexicalized semantic network based on automatically linking Wikipedia with WordNet, and expanded by using additional information from resources like FrameNet and VerbNet. Finally, Predicate Matrix exploits Word Sense Disambiguation algorithms to generate semi-automatic mappings within FrameNet, VerbNet, PropBank, WordNet, and ESO [55].16

7 Discussion and Roadmap

7.1 Summary of Findings

Commonsense knowledge sources use different levels of semantics, come in a variety of forms, and strive to capture diverse notions of common sense. After surveying 20 commonsense knowledge sources, we proposed that their relations can be grouped into 13 dimensions, namely: lexical, similarity, distinctness, part-whole, spatial, creation, utility, desire/goal, quality, comparative, temporal, and relational-other. Most relations can be unambiguously mapped to one of the dimensions.
We apply our dimensions to reorganize the knowledge in the Commonsense Knowledge Graph (CSKG) [27]. Following our devised mapping of relations to dimensions, we add an additional column (relation;dimension) in CSKG, indicating the dimension of each edge. This allows us to make use of the consolidation of seven existing sources done by CSKG, and complement it with the dimensions in order to perform more abstract analysis of its knowledge types.
We designed and ran four experiments to analyze commonsense knowledge in CSKG through the lenses of these 13 dimensions.

In experiment 1, we investigated the coverage of the 13 dimensions in current sources. Some dimensions, like part-whole and similarity, are subject of interest in most sources. Others, like comparative knowledge and knowledge on desires/goals, are rarely captured. Yet, the depth of knowledge on the less commonly represented relations is still high, as illustrated by the 244 thousand desire/goal edges in ATOMIC. Here we also observed that the breadth of focus varies notably across sources, as some (e.g., ConceptNet and Wikidata-CS) cover a wide range of relations, while others (e.g., ATOMIC or WordNet) have a narrower focus.

Experiment 2 posed the question of whether individual knowledge statements are redundant across sources. Our experiments with four sources indicated, with few exceptions, that only a tiny portion of all edges were shared between a pair of sources. This experiment points to a two-fold motivation for node resolution over commonsense sources. On the one hand, node resolution is needed to increase the quality of computing overlap beyond the current lexical comparison of nodes. On the other hand, as the sources have generally complementary goals, it is likely that even with a more semantic computation the overlap will remain low. Node resolution is, thus, essential to consolidate different views of a node (concept) into a single representation.

In experiment 3, we cluster all edges in CSKG according to their dimension, and compare these clusters with clusters based on a language model encoding of each edge. We noted that the overall agreement between the dimensions and the language model-based clustering is relatively low, indicating that language models pay much attention to the edge nodes. However, individual correspondences were noted. Similarity and distinctness quite clearly dominated some of the RoBERTa-based clusters, while other clusters were consistently split between the dimensions of desire/goal and temporal knowledge. Interestingly, the clusters inferred from the RoBERTa embeddings often clustered nodes from different sources into a single cluster.

Finally, in experiment 4 we investigated the impact of the dimensions on a downstream reasoning task of commonsense question answering. We adopted with a recent idea [38] for pretraining language models with knowledge graphs. The best-scoring model in this paper, RoBERTa-large with marginal ranking loss, was fed with knowledge from one of our dimensions at a time, and evaluated in a zero-shot manner on two benchmarks testing broad (CommonsenseQA) and social (SocialIQA) commonsense reasoning. The experiments showed that social commonsense reasoning clearly benefits from temporal, quality, and desire/goal knowledge, whereas the CommonsenseQA benchmark benefits from broad knowledge from all dimensions. Certain dimensions, such as lexical knowledge, were relatively uninformative, as it can be expected that such knowledge has been already acquired by the language models at their initial training stage. While the extent of knowledge plays a role, adding more knowledge is not always beneficial, as the task performance depends on the alignment between the dimensions and the task. This motivates further work on automatic alignment between the task questions and the our dimensions.
This also motivates future work that attempts to evaluate the value of additional content additions related to particular dimensions aimed at improving certain kinds of tasks.

7.2 Outlook

The goal of consolidating and applying commonsense knowledge is an ambitious one, as witnessed by decades of research on this topic. Our 13 dimensions are an effort to reorganize existing commonsense knowledge through unification of its knowledge types. We see this as a necessary, but not sufficient, step towards a modern and comprehensive commonsense resource. The dimensions could facilitate, or complement, several other aspects of this pursuit:

1. Node resolution While potentially controversial, the consolidation of commonsense knowledge relations into dimensions/knowledge types is achievable with careful manual effort. This is largely due to the relatively small number of relations in most current sources. Resolving nodes across sources is another key aspect of this consolidation, strongly motivated by experiment 2 of this paper. Sources capture complementary knowledge, whose combination is prevented by the lack of mappings between their nodes. As nodes are currently intended to represent various aspects of meaning: words, phrases, concepts, frames, events, sentences — their consolidation is not obvious. Moreover, the number of unique nodes in most sources is on the order of many thousands or even millions [27], preventing it from being solvable by mere manual effort. Node resolution could be framed as a ‘static’ disambiguation/clustering task, where each statement for a node is to be classified into one of its meanings, similar to [12]. Here, the set of meanings can be either present (e.g., WordNet synsets) or dynamically inferred from the data. An alternative, ‘dynamic’ approach is defer the node resolution to task time and perform it implicitly as a task of retrieving evidence from a knowledge source [32]. Another option is a combination of the static and the dynamic approaches.

2. Coverage and boundaries
At present, it is difficult to estimate the completeness of commonsense knowledge sources. With the relations organized into dimensions of knowledge, we gain insight into the volume of knowledge that falls within each of the dimensions. An ideal node resolution would take us one step further, allowing us to detect gaps, i.e., understand which relevant facts are not represented by any of the sources. If nodes are resolved to an ontology like WordNet, one could leverage its taxonomy to infer new information. For instance, ConceptNet is at present unable to infer that if barbecues are held in outdoor places, they could be, by extension, be held in a park or someone’s patio. In addition, a more semantic resource would allow us to define constraints over the knowledge and detect anomalies and contradictory knowledge, which can be argued to define the boundaries of the knowledge that can be obtained. It is reasonable that such boundaries exist, as commonsense knowledge is characterized by commonness of its concepts and restricted set of relations [26].
Further, by organizing by dimensions also allows us to describe strengths (and/or weakenesses) of resources. For example, a resource that has many partonomic relationships might be the first resource to consider using if a task requires part-whole reasoning.

3. Generalizable downstream reasoning As current large-scale commonsense sources are primarily text-based, they are lexicalized prior to their combination with language models, losing much of their structure. As this lack of structure prevents us from understanding their coverage and gaps, we are unable to measure their potential for downstream reasoning as a function of the available knowledge. It remains unknown to which extent a more complete source, organized around dimensions of commonsense knowledge, would be able to contribute to improve performance. Experiment 4 showed that there is correspondence between knowledge dimensions and question answering tasks, motivating automatic alignment between the two. Moreover, a comprehensive semantic source may inspire new neuro-symbolic reasoning methods, with potentially enhanced generalizability and explainability, opening the door for reliable commonsense services to be made available in the future.

4. Evaluation and knowledge gaps Experiment 4 showed that the potential of different dimensions for reasoning varies greatly and is largely dependent on the evaluation data. This finding is in line with [38]. The fact that certain dimensions consistently contribute little can be an indicator for gaps in current evaluation. Namely, dimensions like distinctness and spatial which currently contribute little or not at all are likely to be underrepresented in current evaluations. These gaps should ideally be addressed in the future by new benchmarks that will represent these missing dimensions. We note that our set of dimensions is based on the relations found in current popular commonsense sources. Hence, in this paper, we make an assumption that the knowledge types in these sources suffice, or at least have previously sufficed, and can express the desired knowledge. The diversity of knowledge expressed by the relational-other dimension, as pointed out also in [26], might be an indicator for additional, latent dimensions hidden behind the vagueness of this dimension.

8 Conclusions

At present, commonsense knowledge is dispersed across a variety of sources with different foci, strengths, and weaknesses. The complementary knowledge covered by these sources motivates efforts to consolidate them under a common representation. In this paper, we pursued the goal of organizing commonsense relations into a shared set of knowledge dimensions in a bottom-up fashion. Starting from a survey and analysis of the relations found in existing sources, we grouped them into 13 dimensions: lexical, similarity, distinctness, part-whole, spatial, creation, utility, desire/goal, quality, comparative, temporal, and relational-other. As each relation in these sources can be mapped to a dimension, we applied our method to abstract the relations in an existing consolidated resource: the Commonsense Knowledge Graph (CSKG). This allowed us to empirically study the impact of these dimensions. First, we observed that some dimensions are included more often than others, potentially pointing to gaps in the knowledge covered in existing resources. Second, we measured sparse overlap of facts expressed with each dimension across sources, which motivates future work on graph integration through (automated) node resolution. Third, comparing the dimension-based clustering to language model-based unsupervised edge clustering resulted in low overall agreement, though in some cases, the unsupervised clusters were dominated by one or two dimensions. This showed that some of the dimensions represent a stronger signal for language modeling than others. Fourth, we measured the impact of each dimension on a downstream question answering reasoning task, by adapting a state-of-the-art method of pretraining language models with knowledge graphs. Here, we observed that the impact differs greatly per dimension, depending largely on the alignment between the task and the knowledge dimension, as well as on the novelty of knowledge captured by a dimension. While this is in accordance with the findings of the original method [38], the dimension-driven experiments of this paper enabled this hypothesis to be investigated much more precisely, revealing the direct impact of each knowledge dimension rather than entire knowledge sources.

Our experiments inspired a four-step roadmap towards creation and utilization of a comprehensive dimension-centered resource. (1) Node resolution methods should be introduced and applied to unify the resources further. (2) Such an integration would allow us to better understand and improve the coverage/gaps and boundaries of these sources. (3) A large-scale, public semantic graph of commonsense knowledge may inspire novel neuro-symbolic methods, potentially allowing for better generalization and explainability. (4) The impact of a dimension is an indicator of the coverage of that dimension in current evaluation benchmarks; under-represented dimensions are evaluation gaps that may need to be filled by introducing new benchmarks. And, vice-versa, additional knowledge dimensions might be hidden behind the generic relational-other dimension.


This material is based upon work sponsored by the DARPA MCS program under Contract No. N660011924033 with the United States Office Of Naval Research.


  1. This phenomenon can be seen on the benchmark leadearboards, which are dominated by ‘pure’ language models, for instance: https://leaderboard.allenai.org/socialiqa/submissions/public (accessed on January 5th, 2021).
  2. For brevity, we omit the word ‘digital’ in the remainder of this paper.
  3. Here we exclude implicitly comparative knowledge, such as the inferred information that eating food makes one more satisfied from the triple: PersonX eats food – xReact – satisfied.
  4. As discussed before, this assumption might not always hold in practice. Future work should attempt to refine this mapping, e.g., by crowdsourcing or by clustering algorithms.
  5. We leave out the relations prefixed with /r/dbpedia from ConceptNet, as these are being deprecated according to the official documentation: https://github.com/commonsense/conceptnet5/wiki/Relations.
  6. Python script: https://github.com/usc-isi-i2/cskg/blob/master/consolidation/compute_dimensions.py.
  7. If a node has more than one label, then we perform comparison based on the first one.
  8. Notebook: https://github.com/usc-isi-i2/cskg/blob/master/analysis/Overlap.ipynb.
  9. Notebook: https://github.com/usc-isi-i2/cskg/blob/master/embeddings/Summary%20of%20Dimension%20on%20CSKG.ipynb
  10. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.adjusted_rand_score.html
  11. We omit question types with less than 20 questions for CSQA.
  12. According to the classic argument of the mind-body problem, it is inherently impossible to characterize how generalization occurs, due to an explanatory gap [31].
  13. Because our focus is on resources, it is beyond the scope of this paper to discuss seminal investigations on common sense axiomatization, such as Pat Hayes’ naive physics [24] and Ernest Davies’ work on qualitative commonsense reasoning.
  14. The distinction between these types of assertions was formalized in a seminal work by Gentzen [19]
  15. These patterns can be accessed at http://ontologydesignpatterns.org/wiki/Main_Page
  16. https://github.com/newsreader/eso-and-ceo


  1. Aristotle, R.B. Jones and W.D. Ross (2012)

    The metaphysics.

    CreateSpace Independent Publishing Platform.

    External Links: ISBN 9781478203391,

    Cited by: §6.1.

  2. C. F. Baker, C. J. Fillmore and J. B. Lowe (1998)

    The berkeley framenet project.

    In 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Volume 1,

    pp. 86–90.

    Cited by: §1,

  3. P. Banerjee and C. Baral (2020)

    Self-supervised knowledge triplet learning for zero-shot question answering.

    arXiv preprint arXiv:2005.00316.

    Cited by: §1.

  4. S. Bhakthavatsalam, C. Anastasiades and P. Clark (2020)

    GenericsKB: a knowledge base of generic statements.

    arXiv preprint arXiv:2005.00660.

    Cited by: §1,

  5. S. Bhakthavatsalam, K. Richardson, N. Tandon and P. Clark (2020)

    Do dogs have whiskers? a new knowledge base of haspart relations.

    arXiv preprint arXiv:2006.07510.

    Cited by: §3.1,

  6. Y. Bisk, R. Zellers, R. LeBras, J. Gao and Y. Choi (2020)

    PIQA: reasoning about physical commonsense in natural language..

    In AAAI,

    pp. 7432–7439.

    Cited by: §1.

  7. M. Boratko, X. L. Li, R. Das, T. O’Gorman, D. Le and A. McCallum (2020)

    ProtoQA: a question answering dataset for prototypical common-sense reasoning.

    arXiv preprint arXiv:2005.00771.

    Cited by: §1.

  8. A. Bosselut, H. Rashkin, M. Sap, C. Malaviya, A. Celikyilmaz and Y. Choi (2019)

    COMET: commonsense transformers for automatic knowledge graph construction.

    arXiv preprint arXiv:1906.05317.

    Cited by: §3.5.

  9. F. Brentano (2014)

    Psychology from an empirical standpoint.


    Cited by: §6.1.

  10. E. Cambria, Y. Li, F. Z. Xing, S. Poria and K. Kwok (2020)

    Senticnet 6: ensemble application of symbolic and subsymbolic ai for sentiment analysis.

    CIKM’20, Oct 20-24.

    Cited by: §3.1.

  11. R. Casati and A. C. Varzi (1999)

    Parts and places: the structures of spatial representation.

    Mit Press.

    Cited by: §6.1.

  12. J. Chen and J. Liu (2011)

    Combining conceptnet and wordnet for word sense disambiguation.

    In Proceedings of 5th International Joint Conference on Natural Language Processing,

    pp. 686–694.

    Cited by: §7.2.

  13. M. L. de Lacalle, E. Laparra, I. Aldabe and G. Rigau (2016)

    A multilingual predicate matrix.

    In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16),

    pp. 2662–2668.

    Cited by: §6.2.

  14. J. Devlin, M. Chang, K. Lee and K. Toutanova (2018)

    Bert: pre-training of deep bidirectional transformers for language understanding.

    arXiv preprint arXiv:1810.04805.

    Cited by: item 3.

  15. E. K. Dodge, J. Hong and E. Stickles (2015)

    MetaNet: deep semantic automatic metaphor analysis.

    In Proceedings of the Third Workshop on Metaphor in NLP,

    pp. 40–49.

    Cited by: §3.3.

  16. C. Elkan and R. Greiner (1993)

    Building large knowledge-based systems: representation and inference in the cyc project: db lenat and rv guha.


    Cited by: §6.1.

  17. A. Gangemi, M. Alam, L. Asprino, V. Presutti and D. R. Recupero (2016)

    Framester: a wide coverage linguistic linked data hub.

    In European Knowledge Acquisition Workshop,

    pp. 239–254.

    Cited by: §1,

  18. A. Gangemi, N. Guarino, C. Masolo, A. Oltramari and L. Schneider (2002)

    Sweetening ontologies with dolce.

    In International Conference on Knowledge Engineering and Knowledge Management,

    pp. 166–181.

    Cited by: §3.2.

  19. G. Gentzen (1934)

    Investigations into logical deduction. translation printed in m. szabo the collected papers of gerhard gentzen.

    Amsterdam: North-Holland.

    Cited by: footnote 14.

  20. E. J. Gibson and A. D. Pick (2000)

    An ecological approach to perceptual learning and development.

    Oxford University Press, USA.

    Cited by: §6.1.

  21. H. P. Grice (1975)

    Logic and conversation.

    In Speech acts,

    pp. 41–58.

    Cited by: §1.

  22. R. V. Guha, D. Brickley and S. Macbeth (2016)

    Schema. org: evolution of structured data on the web.

    Communications of the ACM 59 (2), pp. 44–51.

    Cited by: §3.2.

  23. J. A. Hartigan and M. A. Wong (1979)

    Algorithm as 136: a k-means clustering algorithm.

    Journal of the royal statistical society. series c (applied statistics) 28 (1), pp. 100–108.

    Cited by: §5.3.

  24. P. J. Hayes (1979)

    The naive physics manifesto.

    Expert systems in the microelectronic age.

    Cited by: footnote 13.

  25. G. D. Hicks (1904)

    Idealism and the problem of knowledge and existence.

    In Proceedings of the Aristotelian Society,

    Vol. 5, pp. 136–178.

    Cited by: §6.1.

  26. F. Ilievski, P. Szekely and D. Schwabe (2020)

    Commonsense knowledge in wikidata.

    arXiv preprint arXiv:2008.08114.

    Cited by: §1,

  27. F. Ilievski, P. Szekely and B. Zhang (2020)

    CSKG: the commonsense knowledge graph.

    External Links: 2012.11490

    Cited by: §1,

  28. F. Ilievski, P. Vossen and M. Van Erp (2017)

    Hunger for contextual knowledge and a road map to intelligent entity linking.

    In International Conference on Language, Data and Knowledge,

    pp. 143–149.

    Cited by: §6.1.

  29. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L. Li and D. A. Shamma (2017)

    Visual genome: connecting language and vision using crowdsourced dense image annotations.

    International journal of computer vision 123 (1), pp. 32–73.

    Cited by: §1,

  30. D. B. Lenat (1995)

    CYC: a large-scale investment in knowledge infrastructure.

    Communications of the ACM 38 (11), pp. 33–38.

    Cited by: §1.

  31. J. Levine (1983)

    Materialism and qualia: the explanatory gap.

    Pacific philosophical quarterly 64 (4), pp. 354–361.

    Cited by: footnote 12.

  32. P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih and T. Rocktäschel (2020)

    Retrieval-augmented generation for knowledge-intensive nlp tasks.

    arXiv preprint arXiv:2005.11401.

    Cited by: §7.2.

  33. B. Y. Lin, X. Chen, J. Chen and X. Ren (2019)

    Kagnet: knowledge-aware graph networks for commonsense reasoning.

    arXiv preprint arXiv:1909.02151.

    Cited by: §1,

  34. B. Y. Lin, S. Lee, R. Khanna and X. Ren (2020)

    Birds have four legs?! numersense: probing numerical commonsense knowledge of pre-trained language models.

    arXiv preprint arXiv:2005.00683.

    Cited by: §1.

  35. H. Liu and P. Singh (2004)

    ConceptNet—a practical commonsense reasoning tool-kit.

    BT technology journal 22 (4), pp. 211–226.

    Cited by: §6.1.

  36. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer and V. Stoyanov (2019)

    RoBERTa: a robustly optimized bert pretraining approach.

    External Links: 1907.11692

    Cited by: §3.5,
    item 3,

  37. K. Ma, J. Francis, Q. Lu, E. Nyberg and A. Oltramari (2019)

    Towards generalizable neuro-symbolic systems for commonsense question answering.

    arXiv preprint arXiv:1910.14087.

    Cited by: §1,

  38. K. Ma, F. Ilievski, J. Francis, Y. Bisk, E. Nyberg and A. Oltramari (2020)

    Knowledge-driven data construction for zero-shot evaluation in commonsense question answering.

    External Links: 2011.03863

    Cited by: §1,
    item 4,

  39. G. L. MacLachlan and I. Reid (1994)

    Framing and interpretation.

    Melbourne University Press.

    Cited by: §6.1.

  40. J. McCarthy (1960)

    Programs with common sense.

    RLE and MIT computation center.

    Cited by: §1.

  41. G. A. Miller (1998)

    WordNet: an electronic lexical database.

    MIT press.

    Cited by: §1,

  42. G. A. Miller (2003)

    The cognitive revolution: a historical perspective.

    Trends in cognitive sciences 7 (3), pp. 141–144.

    Cited by: §6.1.

  43. N. Mostafazadeh, A. Kalyanpur, L. Moon, D. Buchanan, L. Berkowitz, O. Biran and J. Chu-Carroll (2020)

    GLUCOSE: generalized and contextualized story explanations.

    arXiv preprint arXiv:2009.07758.

    Cited by: §1,

  44. R. Navigli and S. P. Ponzetto (2012)

    BabelNet: the automatic construction, evaluation and application of a wide-coverage multilingual semantic network.

    Artificial Intelligence 193, pp. 217–250.

    Cited by: §1,

  45. I. Niles and A. Pease (2001)

    Towards a standard upper ontology.

    In Proceedings of the international conference on Formal Ontology in Information Systems-Volume 2001,

    pp. 2–9.

    Cited by: §3.2.

  46. T. Pellissier Tanon, G. Weikum and F. Suchanek (2020)

    Yago 4: a reason-able knowledge base.

    The Semantic Web.

    Cited by: §3.2.

  47. F. Petroni, T. Rocktäschel, P. Lewis, A. Bakhtin, Y. Wu, A. H. Miller and S. Riedel (2019)

    Language models as knowledge bases?.

    arXiv preprint arXiv:1909.01066.

    Cited by: §3.5.

  48. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier and S. Lazebnik (2016)

    Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models.

    International Journal of Computer Vision, pp. 1–20.

    Cited by: §3.4.

  49. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei and I. Sutskever (2019)

    Language models are unsupervised multitask learners.

    OpenAI Blog 1 (8), pp. 9.

    Cited by: §3.5.

  50. P. M. Roget (2020)

    Roget’s thesaurus.

    Good Press.

    Cited by: §3.3.

  51. J. Romero, S. Razniewski, K. Pal, J. Z. Pan, A. Sakhadeo and G. Weikum (2019)

    Commonsense properties from query logs and question answering forums.

    In Proceedings of the 28th ACM International Conference on Information and Knowledge Management,

    pp. 1411–1420.

    Cited by: §3.1.

  52. M. Sap, R. Le Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof, N. A. Smith and Y. Choi (2019)

    Atomic: an atlas of machine commonsense for if-then reasoning.

    In Proceedings of the AAAI Conference on Artificial Intelligence,

    Vol. 33, pp. 3027–3035.

    Cited by: §1,

  53. M. Sap, H. Rashkin, D. Chen, R. Le Bras and Y. Choi (2019-11)

    Social IQa: commonsense reasoning about social interactions.

    In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP),

    Hong Kong, China, pp. 4463–4473.

    External Links: Link,

    Cited by: §1,

  54. K. K. Schuler (2005)

    VerbNet: a broad-coverage, comprehensive verb lexicon.

    Cited by: §3.3.

  55. R. Segers, P. Vossen, M. Rospocher, L. Serafini, E. Laparra and G. Rigau (2015)

    Eso: a frame based ontology for events and implied situations.

    Proceedings of MAPLEX 2015.

    Cited by: §6.2.

  56. V. Shwartz, P. West, R. L. Bras, C. Bhagavatula and Y. Choi (2020)

    Unsupervised commonsense question answering with self-talk.

    arXiv preprint arXiv:2004.05483.

    Cited by: §1,

  57. P. Singh, T. Lin, E. T. Mueller, G. Lim, T. Perkins and W. L. Zhu (2002)

    Open mind common sense: knowledge acquisition from the general public.

    In OTM Confederated International Conferences” On the Move to Meaningful Internet Systems”,

    pp. 1223–1237.

    Cited by: §3.1.

  58. R. Speer, J. Chin and C. Havasi (2017)

    Conceptnet 5.5: an open multilingual graph of general knowledge.

    In Thirty-First AAAI Conference on Artificial Intelligence,

    Cited by: §1,

  59. A. Talmor, J. Herzig, N. Lourie and J. Berant (2019-06)

    CommonsenseQA: a question answering challenge targeting commonsense knowledge.

    In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),

    Minneapolis, Minnesota, pp. 4149–4158.

    External Links: Link,

    Cited by: §1,

  60. N. Tandon, G. De Melo and G. Weikum (2017)

    Webchild 2.0: fine-grained commonsense knowledge distillation.

    In Proceedings of ACL 2017, System Demonstrations,

    pp. 115–120.

    Cited by: §1,

  61. E. van Miltenburg (2016)

    Stereotyping and bias in the flickr30k dataset.

    In Proceedings of Multimodal Corpora: Computer vision and language processing (MMC 2016), J. Edlund, D. Heylen and P. Paggio (Eds.),

    pp. 1–4.

    External Links: Link

    Cited by: §3.4.

  62. D. Vrandečić and M. Krötzsch (2014)

    Wikidata: a free collaborative knowledgebase.

    Communications of the ACM 57 (10), pp. 78–85.

    Cited by: §1,

  63. B. Williams, H. Lieberman and P. H. Winston (2017)

    Understanding stories with large-scale common sense..


    Cited by: §1.

  64. M. E. Winston, R. Chaffin and D. Herrmann (1987)

    A taxonomy of part-whole relations.

    Cognitive science 11 (4), pp. 417–444.

    Cited by: §6.1.

  65. W. Yang, X. Wang, A. Farhadi, A. Gupta and R. Mottaghi (2018)

    Visual semantic navigation using scene priors.

    arXiv preprint arXiv:1810.06543.

    Cited by: §1.

  66. R. Zellers, Y. Bisk, A. Farhadi and Y. Choi (2019)

    From recognition to cognition: visual commonsense reasoning.

    In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,

    pp. 6720–6731.

    Cited by: §1.

  67. H. Zhang, X. Zhao and Y. Song (2020)

    WinoWhy: a deep diagnosis of essential commonsense knowledge for answering winograd schema challenge.

    arXiv preprint arXiv:2005.05763.

    Cited by: §6.1.


CSIT FUN , 版权所有丨如未注明 , 均为原创丨本网站采用BY-NC-SA协议进行授权
喜欢 (0)
分享 (0)
表情 贴图 加粗 删除线 居中 斜体 签到


  • 昵称 (必填)
  • 邮箱 (必填)
  • 网址