Next: Rich Snippets
So far we have looked only at the producer side of the structured data world – those publishing structured data on their web pages. But how is this data queried or consumed by others? And from where? Knowledge vaults – huge, queryable databases that publish semantic data – are increasingly being developed by search engines and are one of the major consumers of structured data.
After this tutorial, you should be able to:
- Understand what a knowledge graph and a knowledge vault is.
- Understand the knowledge graphs that are available, and who can access them.
- See some of the limitations of current knowledge graphs, and how new probabilistic knowledge bases can overcome them.
- By example understand how to query the Google Knowledge Graph.
Estimated time: 5 minutes
You should have already understood the following lesson (and pre-requisites) before you begin:
- Tutorial 3: Introduction To JSON-LD
Knowledge graphs and knowledge vaults are a new and developing technology currently being developed by the major search engines.
Knowledge vaults are typically large scale databases that express their data using a shared semantic vocabulary, such as the schema.org vocabulary. Some are open for all to use and can be queried using an API.
Their main goal is to store millions of entities (we saw some examples of entities earlier such as books, events, and articles from the schema.org vocabulary) and store reliable, associated facts about those entities. Some of the more prominent players in development include:
- Google Knowledge Graph (and the up and coming Knowledge Vault)
- Bing’s Satori (from Microsoft)
- Freebase (now redundant after being merged into the Google Knowledge Graph)
Google are currently one of the lead players in this area of technology, and so for the purposes of this article we will focus on Google’s Knowledge Graph and its successor the new Knowledge Vault.
Note The term ‘knowledge vault’ at the moment is mainly associated with the new Google knowledge vault. The term ‘knowledge graph’ is more general, applying often as an umbrella term to any large-scale databases that store facts and information using a semantic vocabulary (such as schema.org).
4.1 Google Knowledge Graph
Google Knowledge Graph was first announced in May 2012. Its sources of information include the former Freebase database (one of the first major databases to openly publish its data using a machine-readable semantic vocabulary) and also Wikidata (a free WikiMedia structured database project that is openly editable, rather like Wikipedia).
It is a large collection of entities (and facts associated with them) stored using schema.org vocabulary and, one of its sources is also structured data found on web pages.
Google Knowledge Graph can be queried using an openly available query API. It returns its results in JSON-LD format (as we saw in our previous tutorial). Therefore, any application that can understand JSON-LD and the widely-used schema.org vocabulary can harness and consume the huge, open knowledge base provided by Google.
Note The great thing about harnessing the information in a knowledge graph is that, as it is an open web hosted database, it grows over time. Therefore, a query that you execute this year on a knowledge graph may contain more, or more accurate, information next year.
Let’s introduce the Google Knowledge Graph by way of querying it using the Google Knowledge Graph API.
4.2 A First Query Of The Google Knowledge Graph
One of the simplest ways the knowledge graph can be queried is by using its HTTP/REST interface. Using the HTTP interface it is even possible to view the JSON-LD results in a web browser.
Note As with most web exposed databases, the Knowledge Graph is read only. This is not surprising, as it would be almost impossible to preserve the integrity and security of the knowledge graph if it were open to anyone in the world to write to it.
One of the most common ways to query the knowledge graph is to use a simple search for a search term. To demonstrate, we search for the best match for the search term ‘magolia’ (which is a type of flower):
This query returns the following JSON-LD:
"articleBody": "Magnolia is a large genus of about 210 flowering plant species in the subfamily Magnolioideae of the family Magnoliaceae. It is named after French botanist Pierre Magnol.\nMagnolia is an ancient genus. ",
If you have navigated your way through our previous tutorial, this document should look familiar (it is a JSON-LD document). You can see in the context of the document that the vocabulary is http://schema.org (line 03). And so, understanding the meaning of this vocabulary, any software that we write that consumes this data would know how to interpret the data values returned by the query – i.e. what they mean.
If software can understand what values mean, software can choose how to display the information, or how to process it with other information to get more out of the data.
As an exercise, have a brief look at the JSON-LD above and see if you can get a feel for what information the response contains.
Important Although the knowledge graph returns fully compliant JSON-LD, it does not return links between entities as you would obtain from a full RDF graph. There are many properties of the magnolia that are common to many entities, not just flowers. For example the ‘description’ property (line 20) or image ‘contentUrl’ property (line 22). Because schema.org offers a single vocabulary in which to publish these common properties we can write software that knows what they are referring to. That can save many, many man hours of software development time as we can start to standardize the way we display, or process, the same types of properties across thousands or millions of entities.
As we can understand what some of these properties mean, we could choose to use these properties to create our own display template from this returned entity. For example, we could choose to display the result by showing a small image thumbnail of the result, combined with the description and article snippet:
Magnolia is a large genus of about 210 flowering plant species in the subfamily Magnolioideae of the family Magnoliaceae. It is named after French botanist Pierre Magnol.
Magnolia is an ancient genus.
This simple example of a display template could be used to display a snippet of information about literally thousands or millions of entities defined in schema.org vocabulary.
Important Imagine if you developed an application that understood the schema.org vocabulary – you could take the results of your queries and automatically know how to display, or in some cases even process, the data returned. This is a key benefit of the semantic web – being able to allow machines to understand data, as well as humans.
4.3 Looking Ahead: The Knowledge Vault
The Google Knowledge Graph is a large, aggregated semantic database expressed in schema.org vocabulary. However, one of its problems is that without also being able to look at the masses of unstructured data on the web, well structured and trustworthy data is difficult to come by.
For example, Google’s academic paper on the new Knowledge Vault highlights that over 70% of people in the well-administrated Freebase database do not have a place of birth or nationality information. And, for less common properties (or predicates), this is even worse.
And so, some of the current structured databases being built tend to include only ‘key facts’ information about relatively few, and very well known, entities.
This is why Google is currently working on the Knowledge Vault. With this new advancement to the knowledge graph, Google is attempting to automatically acquire facts from the mass of unstructured information still on the web.
The Knowledge Vault is an attempt to crawl the web for facts and properties, and by comparing to sources of reliable structured data already available, attempt to rank the facts that have automatically been acquired to give it a confidence level.
This is known as a probablistic knowledge base, and it is the only way that Google believe a web-scale knowledge base can be practically built.
At the time of writing, Google is reporting it now contains 1.6 billion candidate triples, about 1100 entities. This will grow as the knowledge base continues to extract factual data from the web.
4.4 How Data Is Stored In The Knowledge Vault
Rather exciting for the semantic web community, Google are choosing to store all the facts and knowledge hoovered up by the knowledge crawler as RDF triples.
Note If you are not currently familiar with RDF and graph databases, we highly recommend looking through our acclaimed Semantic Web Primer tutorial series which will equip you with all the knowledge you need to know to understand the basics of RDF and web ontologies.
Machines will be able to understand the data in this database by looking at the web ontology that Google is going to use to store the data.
Why is this new development so exciting? Because small and large organisations alike will be able to query and harness this knowledge base for themselves. Being able to rank the probability of a ‘fact’ being true and to gain new insights and facts using machines will provide as yet unknown, but potentially huge, benefits for businesses, governments, and third sector organisations alike.
We will be keeping a close eye on the Knowledge Vault as new press releases from Google on the subject are published.
You have completed this lesson. You should now understand the following:
- The basic concepts of how knowledge graphs are stores of structured data.
- That structured data from web pages is one source of structured data in knowledge graphs.
- That Google Knowledge graph returns the results of queries in JSON-LD format.
- What a probabilistic knowledge base is, and how the coming Google Knowledge Vault is one such knowledge base.
You should now be able to start the following tutorial:
- Tutorial 5: Rich Snippets