Query data using GraphQL and Python SDK

GraphQL Search API

CluedIn provides a powerful GraphQL API that is very helpful in almost every interaction with CluedIn (one of a few exceptions is Ingestion Endpoints that are not GraphQL). You can read about CluedIn GraphQL in a separate article: documentation.cluedin.net/consume/graphql.

When you open the Consume section in your CluedIn instance, you will find a GraphQL playground where you can run GraphQL queries.

In our example, we have some /Duck entities (from the DuckTales). To find them, I can run a query like this:

Simple GraphQL search query

{
  search(query:"+entityType:/Duck")
  {
    entries {
      id
      name
      entityType
    }
  }
}

The query will return me the top 20 of the /Duck entities. The query parameter tells the API to filter the response by a given business domain (previously entity type). You can also specify the entity properties you want to get in the payload: id, name, and entityType.

GraphQL search query with variables and cursor

We can sophisticate the query a little so that it will take parameters from the GraphQL variables:

query ($query: String, $pageSize: Int) {
  search(
    query: $query
    pageSize: $pageSize
    sort: FIELDS
    sortFields: {field: "id", direction: ASCENDING}
  ) {
    cursor
    entries {
      id
      name
      entityType
    }
  }
}

Variables:

{
  "query": "+entityType:/Duck",
  "pageSize": 10000
}

Here are a few things to notice in this query:

query ($query: String, $pageSize: Int) - we defined a query with parameters. We can also give this query a name: query searchEntities($query: String, $pageSize: Int)
sort: FIELDS sortFields: {field: "id", direction: ASCENDING} - it’s important to sort by a unique field to get predictable results when you page data.
cursor - we ask CluedIn to return a special value that we can pass to our next query to get the next page of results.
"pageSize": 10000 - By default, the page size is 20, so if you have millions of entities, you will get only the first 20. Setting the page size to its maximum value (10000) will decrease the number of requests you send to CluedIn API, but there are situations when you only want to get a few entities from the top, and a query with a smaller page size will be faster.

GraphQL

CluedIn Python SDK

Installation and initialization

You can use any programming language to send a GraphQL request to CluedIn and get the data back. Let’s explore how you can do it in Python.

First of all, you will need to install the latest version of CluedIn Python SDK:

%pip install cluedin

Let’s import it then together with Pandas (we will use Pandas to load data in DataFrames):

import pandas as pd
import cluedin

You need an API token; you can copy or create a new one by going to “API Tokens” under “Administration” in CluedIn.

Now, in our example, CluedIn is installed at https://foobar.contoso.com/, so we need to initialize a Context for our CluedIn instance by providing org_name (foobar), domain (contoso.com), and the access_token (the one you copied from CluedIn UI):

ctx = cluedin.Context.from_dict({
    'domain': 'contoso.com',
    'org_name': 'foobar',
    'access_token': '{paste_your_token_here}'
})

Running a GraphQL query

We can now run GraphQL queries from Python code:

query = """
query searchEntities($query: String, $pageSize: Int) {
  search(
    query: $query
    pageSize: $pageSize
    sort: FIELDS,
    sortFields: {field: "id", direction: ASCENDING}
  ) {
    cursor
    entries {
      id
      name
      entityType
    }
  }
}
"""

variables = {
    'query': '+entityType:/Duck',
    'pageSize': 3
}

cluedin.gql.gql(ctx, query=query, variables=variables)

The result is top three entities (because we use the page size = 3 for demo purpose), and we also get a cursor that we can use to get the next page:

{'data': {'search': {'cursor': 'ewAiAFAAYQBnAGUAIgA6ADEALAAiAFAAYQBnAGUAUwBpAHoAZQAiADoAMwAsACIAQwBvAG0AcABvAHMAaQB0AGUAQQBmAHQAZQByACIAOgB7AH0ALAAiAFMAZQBhAHIAYwBoAEEAZgB0AGUAcgAiADoAWwAiADYAMwA1ADMAOAAzAGEAOQAtADkAYwA3ADUALQA1AGQANgAxAC0AOABmADIAYgAtAGYAZQA0ADkANgBmAGQAOAAyAGIAZQA3ACIALAAiADYAMwA1ADMAOAAzAGEAOQAtADkAYwA3ADUALQA1AGQANgAxAC0AOABmADIAYgAtAGYAZQA0ADkANgBmAGQAOAAyAGIAZQA3ACIAXQB9AA==',
   'entries': [{'id': '145afb55-4e78-5dad-b208-633b5b6d19cf',
     'name': 'Donald Duck',
     'entityType': '/Duck'},
    {'id': '17bad60e-6782-5ae5-84bf-7efe05e78e58',
     'name': 'Jake McDuck',
     'entityType': '/Duck'},
    {'id': '635383a9-9c75-5d61-8f2b-fe496fd82be7',
     'name': 'Dewey Duck',
     'entityType': '/Duck'}]}}}

Getting the next page

We can change our code to pass the cursor as a parameter:

query = """
query searchEntities($cursor: PagingCursor, $query: String, $pageSize: Int) {
  search(
    query: $query
    cursor: $cursor
    pageSize: $pageSize
    sort: FIELDS,
    sortFields: {field: "id", direction: ASCENDING}
  ) {
    cursor
    entries {
      id
      name
      entityType
    }
  }
}
"""

variables = {
    'query': '+entityType:/Duck',
    'pageSize': 3
    'cursor': 'ewAiAFAAYQBnAGUAIgA6ADEALAAiAFAAYQBnAGUAUwBpAHoAZQAiADoAMwAsACIAQwBvAG0AcABvAHMAaQB0AGUAQQBmAHQAZQByACIAOgB7AH0ALAAiAFMAZQBhAHIAYwBoAEEAZgB0AGUAcgAiADoAWwAiADYAMwA1ADMAOAAzAGEAOQAtADkAYwA3ADUALQA1AGQANgAxAC0AOABmADIAYgAtAGYAZQA0ADkANgBmAGQAOAAyAGIAZQA3ACIALAAiADYAMwA1ADMAOAAzAGEAOQAtADkAYwA3ADUALQA1AGQANgAxAC0AOABmADIAYgAtAGYAZQA0ADkANgBmAGQAOAAyAGIAZQA3ACIAXQB9AA=='
}

cluedin.gql.gql(ctx, query=query, variables=variables)

The result is the next three entities:

{'data': {'search': {'cursor': 'ewAiAFAAYQBnAGUAIgA6ADIALAAiAFAAYQBnAGUAUwBpAHoAZQAiADoAMwAsACIAQwBvAG0AcABvAHMAaQB0AGUAQQBmAHQAZQByACIAOgB7AH0ALAAiAFMAZQBhAHIAYwBoAEEAZgB0AGUAcgAiADoAWwAiADkAMwBiADkAMgA4ADMANQAtADkANgBmADIALQA1ADYAYQA5AC0AOQA4AGMAMAAtAGMAOAA0ADgAMgAzADYANQAyADEAYQA5ACIALAAiADkAMwBiADkAMgA4ADMANQAtADkANgBmADIALQA1ADYAYQA5AC0AOQA4AGMAMAAtAGMAOAA0ADgAMgAzADYANQAyADEAYQA5ACIAXQB9AA==',
   'entries': [{'id': '6ae43a44-81b4-5fd7-9c7b-47cb24d407ea',
     'name': 'Angus McDuck',
     'entityType': '/Duck'},
    {'id': '9353b703-13d8-59a1-886c-f40b95283c06',
     'name': 'Hortense McDuck',
     'entityType': '/Duck'},
    {'id': '93b92835-96f2-56a9-98c0-c848236521a9',
     'name': 'Matilda McDuck',
     'entityType': '/Duck'}]}}}

Using a generator

But what if you want to avoid manually passing a new cursor to every new call? You can just use the cluedin.gql.entries method, and it will return you a Generator that you can convert to a list or just iterate as you wish:

...

# this is where you need a smaller page size
# if you don't want to iterate to the end
variables = {
    'query': '+entityType:/Duck',
    'pageSize': 2
}

generator = cluedin.gql.entries(ctx, query=query, variables=variables)
print(next(generator))
print(next(generator))

Result:

{'id': '145afb55-4e78-5dad-b208-633b5b6d19cf', 'name': 'Donald Duck', 'entityType': '/Duck'}
{'id': '17bad60e-6782-5ae5-84bf-7efe05e78e58', 'name': 'Jake McDuck', 'entityType': '/Duck'}

Using automatic pagination

You can also load all entities in a DataFrame, in this case, it makes sense using the maximum page size (10000) to reduce the number of calls to the server:

query = """
query searchEntities($cursor: PagingCursor, $query: String, $pageSize: Int) {
  search(
    query: $query
    cursor: $cursor
    pageSize: $pageSize
    sort: FIELDS,
    sortFields: {field: "id", direction: ASCENDING}
  ) {
    cursor
    entries {
      id
      name
      entityType
    }
  }
}
"""

variables = {
    'query': '+entityType:/Duck',
    'pageSize': 10_000
}

print(pd.DataFrame(cluedin.gql.entries(ctx, query=query, variables=variables)))

Result:

                                      id             name entityType
 145afb55-4e78-5dad-b208-633b5b6d19cf      Donald Duck      /Duck
 17bad60e-6782-5ae5-84bf-7efe05e78e58      Jake McDuck      /Duck
 635383a9-9c75-5d61-8f2b-fe496fd82be7       Dewey Duck      /Duck
 6ae43a44-81b4-5fd7-9c7b-47cb24d407ea     Angus McDuck      /Duck
 9353b703-13d8-59a1-886c-f40b95283c06  Hortense McDuck      /Duck
 93b92835-96f2-56a9-98c0-c848236521a9   Matilda McDuck      /Duck
 a388a77d-7d43-51d1-87b2-efb4f854b5ad    Fergus McDuck      /Duck
 b2fb05cb-e806-5088-955b-2ff3f9261236   Scrooge McDuck      /Duck
 b8fc5baf-b679-5e26-abb5-50ca77467992        Huey Duck      /Duck
 cd8fe1dd-5637-5037-931e-f8bf1a15c0b4       Della Duck      /Duck
f5bf5d66-5698-515a-800e-9d778d916dcd       Louie Duck      /Duck

Using the search method

Starting from CluedIn Python SDK 2.5.0, you can shrink the code above to one line and get almost the same result. The difference is that it will also return you all codes and properties of entities, and you don’t have to copy and paste the same GraphQL query every time you want to query some data:

# this will return all the queried entities with all properties and codes
print(pd.DataFrame(cluedin.gql.search(ctx, '+entityType:/Duck')))

Using the search method with a subset of data

Finally, if you only want a subset of data, you can use the itertools.islice, but then remember to set a smaller page_size to not query more data than you need:

from itertools import islice

# gets a generaror that queries entities from the server by three
gen = cluedin.gql.search(ctx, '+entityType:/Duck', page_size=3)

# wrap in an iterator that stops after three iterations
iter = islice(gen, 3)

# convert to DataFrame
df = pd.DataFrame(iter)

print(df)

Or simply:

pd.DataFrame(itertools.islice(cluedin.gql.search(ctx, '+entityType:/Duck', 3), 3))

GraphQL Actions

You can add GraphQL Actions when running GraphQL queries in the Python SDK. Actions are a way to run commands in bulk, such as processing, enriching, or deleting entities.

For example, if you take the previous GraphQL query and add an actions field to it, you can delete, enrich or process entities entities in bulk. Here is an example of bulk enrichment:

query searchEntities($cursor: PagingCursor, $query: String, $pageSize: Int) {
  search(
    query: $query
    cursor: $cursor
    pageSize: $pageSize
    sort: FIELDS,
    sortFields: {field: "id", direction: ASCENDING}
  ) {
    cursor
    entries {
      id
      name
      entityType
      actions {
        enrich
      }
    }
  }
}

And here is another example of how to use Actions in the Python SDK to delete entities (please note that this is a destructive operation and should be used with caution): delete_entities.py.

Here is a list of available actions you can use:

deleteEntity - deletes an entity.
postProcess - reprocesses an entity.
enrichEntity - enriches an entity.