Skip to main content

Import

     

Although importing itself is pretty straightforward, creating an optimized import strategy needs a bit of planning on your end. Hence, before we start with this guide, there are a few things to keep in mind.

  • When importing, you want to make sure that you max out all the CPUs available. It’s more often than not the case that the import script is the bottleneck.

    • Tip, use htop when importing to see if all CPUs are maxed out.
    • Learn more about how to plan your setup here.
  • Use parallelization; if the CPUs are not maxed out, just add another import process.

  • For Kubernetes, fewer large machines are faster than more small machines. Just because of network latency.

Importing

First of all, some rules of thumb.

  • You should always use batch import.
  • As mentioned above, max out your CPUs (on the Weaviate cluster). Often your import script is the bottleneck.
  • Process error messages.
  • Some clients (especially Python) have some build-in logic to efficiently regulate batch importing.

Assuming that you’ve read the schema getting started guide, you import data based on the classes and properties defined in the schema.

For the purpose of this tutorial, we’ve prepared a data.json file, which contains a few Authors and Publications. Download it from here, and add it to your project.

Now, to import the data we need to follow these steps:

  1. Connect to your Weaviate instance
  2. Load objects from the data.json file
  3. Prepare a batch process
  4. Loop through all Publications
    • Parse each publication – to a structure expected by the language client of your choice
    • Push the object through a batch process
  5. Loop through all Authors
    • Parse each author – to a structure expected by the language client of your choice
    • Push the object through a batch process
  6. Flush the batch process – in case there are any remaining objects in the buffer

Here is the full code you need to import the Publications (note, the importAuthors example is shorter).

import weaviate

client = weaviate.Client("https://some-endpoint.semi.network/")

# Load data from the data.json file
data_file = open('data.json')
data = json.load(data_file)
# Closing file
data_file.close()

# Configure a batch process
client.batch.configure(
batch_size=100,
dynamic=True,
timeout_retries=3,
callback=None,
)

# Batch import all Publications
for publication in data['publications']:
print("importing publication: ", publication["name"])

properties = {
"name": publication["name"]
}

client.batch.add_data_object(properties, "Publication", publication["id"], publication["vector"])

# Flush the remaining buffer to make sure all objects are imported
client.batch.flush()

And here is the code to import Authors.

# Batch import all Authors
for author in data['authors']:
print("importing author: ", author["name"])

properties = {
"name": author["name"],
"age": author["age"],
"born": author["born"],
"wonNobelPrize": author["wonNobelPrize"],
"description": author["description"]
}

client.batch.add_data_object(properties, "Author", author["id"], author["vector"])

# Flush the remaining buffer to make sure all objects are imported
client.batch.flush()

You can quickly check the imported object by opening – weaviate-endpoint/v1/objects in a browser, like this:

https://some-endpoint.semi.network/v1/objects

Or you can read the objects in your project, like this:

import weaviate
import json

client = weaviate.Client("https://some-endpoint.semi.network/")

all_objects = client.data_object.get()
print(json.dumps(all_objects))

Other object operations

All other CRUD object operations are available in the objects RESTful API documentation and the batch RESTful API documentation.

Recapitulation

Importing into Weaviate needs some planning on your side. In almost all cases, you want to use the batch endpoint to create data objects. More often than not, the bottleneck sits in the import script and not in Weaviate. Try to optimize for maxing out all CPUs to get the fastest import speeds.

What would you like to learn next?