Skip to main content

Schema

     

You’ve made it to the schema getting started guide! The schema is the place where you will not only set data types, cross-references, and more, but you’ll be tweaking index settings (ANN, reverse index, BM25).

This will also be a guide to getting your hands dirty! O, and this guide is a bit longer 😉

Prerequisites

At this point, you should have Weaviate running either:

Client Libraries

You can communicate with Weaviate from your code by using one of the available client libraries (currently available for Python, JavaScript, Java and Go) or the restful API.

First, point of business, is to add the client library to your project.

  • For Python add the weaviate-client to your system libraries with pip:
pip install weaviate-client
  • For JavaScript add weaviate-client to your project with npm:
npm install weaviate-client
  • For Java add this dependency to your project:
<dependency>
<groupId>technology.semi.weaviate</groupId>
<artifactId>client</artifactId>
<version>3.2.0</version>
</dependency>
  • For Go add weaviate-go-client to your project with go get:
go get github.com/semi-technologies/weaviate-go-client/v4

Connect to Weaviate

First, let’s make sure that you can connect to your Weaviate instance. To do this we need the host endpoint to your instance.

  • If you use WCS – it should be based on the cluster-id you’ve created in the previous lesson - just replace some-endpoint in the code example below with the cluster-id.

or localhost:8080 if you are running Weaviate locally.

import weaviate
import json

client = weaviate.Client("https://some-endpoint.semi.network/") # <== if you use the WCS
# or
client = weaviate.Client("http://localhost:8080") # <== if you use Docker-compose

schema = client.schema.get()
print(json.dumps(schema))

The result should look like this:

{"classes": []}

This means you’re connected to an empty Weaviate.

info

From now on, all examples will provide the code using the WCS endpoint:

"some-endpoint.semi.network/"

Replace the value to match your host endpoint.

Resetting your Weaviate instance

If this is not the case and you see (old) classes, you can restart your instance, or you can run the following if you’re using the Python client:

import weaviate
import json

client = weaviate.Client("https://some-endpoint.semi.network/")

# delete all classes
client.schema.delete_all()

schema = client.schema.get()
print(json.dumps(schema))

Create your first class!

Let’s create your first class!

We’ll take the example of the Author from the basics guide.

Our Authors have the following properties:

  • name: type string
  • age: type int
  • born: type date
  • wonNobelPrize: type boolean
  • description: type text

Run the below code in you application, which will define the schema for the Author class.

import weaviate
import json

client = weaviate.Client("https://some-endpoint.semi.network/") # <== update the endpoint here!

# we will create the class "Author" and the properties
# from the basics section of this guide
class_obj = {
"class": "Author", # <= note the capital "A".
"description": "A description of this class, in this case, it is about authors",
"properties": [
{
"dataType": [
"string"
],
"description": "The name of the Author",
"name": "name",
},
{
"dataType": [
"int"
],
"description": "The age of the Author",
"name": "age"
},
{
"dataType": [
"date"
],
"description": "The date of birth of the Author",
"name": "born"
},
{
"dataType": [
"boolean"
],
"description": "A boolean value if the Author won a nobel prize",
"name": "wonNobelPrize"
},
{
"dataType": [
"text"
],
"description": "A description of the author",
"name": "description"
}
]
}

# add the schema
client.schema.create_class(class_obj)

# get the schema
schema = client.schema.get()

# print the schema
print(json.dumps(schema, indent=4))

The result should look something like this:

{
"classes": [
{
"class": "Author",
"description": "A description of this class, in this case, it's about authors",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"properties": [
{
"dataType": [
"string"
],
"description": "The name of the Author",
"name": "name",
"tokenization": "word"
},
{
"dataType": [
"int"
],
"description": "The age of the Author",
"name": "age"
},
{
"dataType": [
"date"
],
"description": "The date of birth of the Author",
"name": "born"
},
{
"dataType": [
"boolean"
],
"description": "A boolean value if the Author won a nobel prize",
"name": "wonNobelPrize"
},
{
"dataType": [
"text"
],
"description": "A description of the author",
"name": "description",
"tokenization": "word"
}
],
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 2000000,
"flatSearchCutoff": 40000,
"distance": "cosine"
},
"vectorIndexType": "hnsw",
"vectorizer": "none"
}
]
}

Wow! Whut, that’s a lot more than we’ve added!

Correct, that’s Weaviate adding some default config for you. You can change, improve, tweak, and update this, but that’s for a later expert guide.

Now, let’s add a second class called Publication. We will use to it store info about publication outlets like The New York Time or The Guardian.

Our Publication will contain one property:

  • name: type string

Run the below code in your application.

import weaviate
import json

client = weaviate.Client("https://some-endpoint.semi.network/")

# we will create the class "Publication" and the properties
# from the basics section of this guide
class_obj = {
"class": "Publication",
"description": "A description of this class, in this case, it is about publications",
"properties": [
{
"dataType": [
"string"
],
"description": "The name of the Publication",
"name": "name",
}
]
}

# add the schema
client.schema.create_class(class_obj)

# get the schema
schema = client.schema.get()

# print the schema
print(json.dumps(schema, indent=4))

The result should look something like this:

{
"classes": [
{
"class": "Author",
"description": "A description of this class, in this case, it's about authors",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"properties": [
{
"dataType": [
"string"
],
"description": "The name of the Author",
"name": "name",
"tokenization": "word"
},
{
"dataType": [
"int"
],
"description": "The age of the Author",
"name": "age"
},
{
"dataType": [
"date"
],
"description": "The date of birth of the Author",
"name": "born"
},
{
"dataType": [
"boolean"
],
"description": "A boolean value if the Author won a nobel prize",
"name": "wonNobelPrize"
},
{
"dataType": [
"text"
],
"description": "A description of the author",
"name": "description",
"tokenization": "word"
}
],
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 2000000,
"flatSearchCutoff": 40000,
"distance": "cosine"
},
"vectorIndexType": "hnsw",
"vectorizer": "none"
},
{
"class": "Publication",
"description": "A description of this class, in this case, it's about authors",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"properties": [
{
"dataType": [
"string"
],
"description": "The name of the Publication",
"name": "name",
"tokenization": "word"
}
],
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 2000000,
"flatSearchCutoff": 40000,
"distance": "cosine"
},
"vectorIndexType": "hnsw",
"vectorizer": "none"
}
]
}

Note, we now have the Author and the Publication in there!

info

Auto schema feature

You can import data into Weaviate without creating a schema. Weaviate will use all default settings, and guess what data type you use. If you have a setup with modules, Weaviate will also guess the default settings for the modules.

Although auto schema works well for some instances, we always advise manually setting your schema to optimize Weaviate’s performance.

Setting cross-references

Now, that we have these two classes, we can use a cross-reference to indicate that an Author, writesFor a Publication. To achieve this, we want to update the Author class to contain the cross-reference to Publication.

Run the below code in your application to update the Author class with the writesFor cross-reference to Publication.

import weaviate
import json

client = weaviate.Client("https://some-endpoint.semi.network/")

add_prop = {
"dataType": [
"Publication" # <== note how the name of the class is the cross reference
],
"name": "writesFor"
}

# Add the property
client.schema.property.create("Author", add_prop)

# get the schema
schema = client.schema.get()

# print the schema
print(json.dumps(schema, indent=4))

The result should look something like this:

{
"classes": [
{
"class": "Author",
"description": "A description of this class, in this case, it's about authors",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"properties": [
{
"dataType": [
"string"
],
"description": "The name of the Author",
"name": "name",
"tokenization": "word"
},
{
"dataType": [
"int"
],
"description": "The age of the Author",
"name": "age"
},
{
"dataType": [
"date"
],
"description": "The date of birth of the Author",
"name": "born"
},
{
"dataType": [
"boolean"
],
"description": "A boolean value if the Author won a nobel prize",
"name": "wonNobelPrize"
},
{
"dataType": [
"text"
],
"description": "A description of the author",
"name": "description",
"tokenization": "word"
},
{
"dataType": [
"Publication"
],
"name": "writesFor"
}
],
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 2000000,
"flatSearchCutoff": 40000,
"distance": "cosine"
},
"vectorIndexType": "hnsw",
"vectorizer": "none"
},
{
"class": "Publication",
"description": "A description of this class, in this case, it's about authors",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"properties": [
{
"dataType": [
"string"
],
"description": "The name of the Publication",
"name": "name",
"tokenization": "word"
}
],
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 2000000,
"flatSearchCutoff": 40000,
"distance": "cosine"
},
"vectorIndexType": "hnsw",
"vectorizer": "none"
}
]
}

Note this part (this is just a chunk of the response):

{
"classes": [
{
"class": "Author",
"properties": [
{
"dataType": [
"Publication"
],
"name": "writesFor"
}
]
}
]
}

We can also set it the other way around, a Publication, has, Authors. To achieve this, we want to update the Publication class to contain the has cross-reference to Author.

import weaviate
import json

client = weaviate.Client("https://some-endpoint.semi.network/")

add_prop = {
"dataType": [
"Author" # <== note how the name of the class is the cross reference
],
"name": "has"
}

# Add the property
client.schema.property.create("Publication", add_prop)

# get the schema
schema = client.schema.get()

# print the schema
print(json.dumps(schema, indent=4))

This results in:

{
"classes": [
{
"class": "Author",
"description": "A description of this class, in this case, it's about authors",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"properties": [
{
"dataType": [
"string"
],
"description": "The name of the Author",
"name": "name",
"tokenization": "word"
},
{
"dataType": [
"int"
],
"description": "The age of the Author",
"name": "age"
},
{
"dataType": [
"date"
],
"description": "The date of birth of the Author",
"name": "born"
},
{
"dataType": [
"boolean"
],
"description": "A boolean value if the Author won a nobel prize",
"name": "wonNobelPrize"
},
{
"dataType": [
"text"
],
"description": "A description of the author",
"name": "description",
"tokenization": "word"
},
{
"dataType": [
"Publication"
],
"name": "writesFor"
}
],
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 2000000,
"flatSearchCutoff": 40000,
"distance": "cosine"
},
"vectorIndexType": "hnsw",
"vectorizer": "none"
},
{
"class": "Publication",
"description": "A description of this class, in this case, it's about authors",
"invertedIndexConfig": {
"bm25": {
"b": 0.75,
"k1": 1.2
},
"cleanupIntervalSeconds": 60,
"stopwords": {
"additions": null,
"preset": "en",
"removals": null
}
},
"properties": [
{
"dataType": [
"string"
],
"description": "The name of the Publication",
"name": "name",
"tokenization": "word"
},
{
"dataType": [
"Author"
],
"name": "has"
}
],
"shardingConfig": {
"virtualPerPhysical": 128,
"desiredCount": 1,
"actualCount": 1,
"desiredVirtualCount": 128,
"actualVirtualCount": 128,
"key": "_id",
"strategy": "hash",
"function": "murmur3"
},
"vectorIndexConfig": {
"skip": false,
"cleanupIntervalSeconds": 300,
"maxConnections": 64,
"efConstruction": 128,
"ef": -1,
"dynamicEfMin": 100,
"dynamicEfMax": 500,
"dynamicEfFactor": 8,
"vectorCacheMaxObjects": 2000000,
"flatSearchCutoff": 40000,
"distance": "cosine"
},
"vectorIndexType": "hnsw",
"vectorizer": "none"
}
]
}

Note this part (this is just a chunk of the response):

{
"classes": [
{
"class": "Author",
"properties": [
{
"dataType": [
"Publication"
],
"name": "writesFor"
}
]
},
{
"class": "Publication",
"properties": [
{
"dataType": [
"Author"
],
"name": "has"
}
]
}
]
}
info

You can set cross-references in all directions and later (as we will see while querying) filter on them. Please, be aware that Weaviate is not a graph database (remember?). This means that dealing with -for example- many-to-many relationships or things like shortest path algorithms is not in our wheelhouse.

Other schema operations

All schema operations are available in the API documentation for the schema endpoint. The documentation also includes examples in different client languages.

Recapitulation

  • Weaviate has a schema where you will define how your data objects will be indexed.
  • Weaviate’s schema is class property based.
  • The schema is highly configurable but comes with pre-defined settings.
  • There is an auto schema function, but for optimal usage, it’s better to manually create a schema.

What would you like to learn next?