CluedIn Documentation logo CluedIn Documentation

Introduction

CluedIn official integrations are one size fits all integrations. They will generally try to ingest as much data as they can.

If you want to ingest data in a precise fashion or want to ingest data from an in-house tool, an old tool, from some custom APIs, you will need to create your own integration.

Pre-requesite

CluedIn is a .NET platform. So you will need:

  • .NET installed
  • Visual Studio installed
  • Docker

Creating initial template

To avoid cumbersome boilerplating, CluedIn provides you a script to generate a working Visual studio solution.

  1. Create a folder for your provider
     mkdir my-first-integration
     cd my-first-integration
    
  2. Run the generator
     docker run --rm -ti -v ${PWD}:/generated cluedin/generator-crawler-template
    

    The generator will ask some questions and then generate all your solution files:

          _-----_     ╭──────────────────────────╮
         |       |    │  Welcome to the awesome  │
         |--(o)--|    │    CluedIn integration   │
        `---------´   │        generator!        │
         ( _´U`_ )    ╰──────────────────────────╯
         /___A___\   /
          |  ~  |
        __'.___.'__
      ´   `  |° ´ Y `
    
     ? Name of this crawler? MyFirstIntegration
     ? Will it support webhooks? No
     ? Does it require OAuth? No
    
  3. Initialize a git repo
     git init
     git add .
     git commit -m "Initial commit"
    
  4. Open the solution in Visual Studio and build it or alternatively you should also build it from the command line using the dotnet cli: dotnet build

Adding a Model

There are several steps needed to create a crawler that fetches data, creates Clues and passes them back to CluedIn for processing. Please refer to our Hello World sample repository for a working example. This is based on a simple external JSON service

The following is the minimal steps required to replicate the Hello World example:

  1. Create model classes. You can use a subgenerator for this:
     docker run --rm -ti -v ${PWD}:/generated cluedin/generator-crawler-template crawler-template:model
    
  2. Answer the questions as follows, to create a User model and vocabulary, similar to the one in the example User.cs
          _-----_     ╭──────────────────────────╮
         |       |    │    This sub-generator    │
         |--(o)--|    │   allows to create new   │
        `---------´   │       vocabularies       │
         ( _´U`_ )    ╰──────────────────────────╯
         /___A___\   /
          |  ~  |
        __'.___.'__
      ´   `  |° ´ Y `
     ? What is the model name? User
     ? What is the entity type? Person
     ? Enter a comma separated list of properties to add to the model id,name,username,email
     ? Choose the visibility for key: id(undefined) Visible
     ? Choose the type for key id Integer
     ? Should key id map to a common vocab? None
     ? Choose the visibility for key: name(undefined) Visible
     ? Choose the type for key name Text
     ? Should key name map to a common vocab? None
     ? Choose the visibility for key: username(undefined) Visible
     ? Choose the type for key username Text
     ? Should key username map to a common vocab? None
     ? Choose the visibility for key: email(undefined) Hidden
     ? Choose the type for key email Email
     ? Should key email map to a common vocab? ContactEmail
        create src/MyFirstIntegration.Core/Models/User.cs
        create src/MyFirstIntegration.Crawling/ClueProducers/UserClueProducer.cs
        create src/MyFirstIntegration.Crawling/Vocabularies/UserVocabulary.cs
        create test/MyFirstIntegration.Crawling.Unit.Test/ClueProducers/UserClueProducerTests.cs
    

    This will generate 4 files as shown above. If you try to run the tests you will notice there is a failing one, as we need to complete some work in the ClueProducer.

  3. Go to the src/MyFirstIntegration.Crawling/ClueProducers/UserClueProducer.cs file, in line 29 uncomment the following code:
     if(input.Name != null)
         data.Name = input.Name;
    
  4. Delete all other comments in the UserClueProducer.cs file.

  5. Open the src/MyFirstIntegration.Infrastructure/MyFirstIntegrationClient.cs and modify line 16 with the URL for the endpoint:
         private const string BaseUri = "https://jsonplaceholder.typicode.com";
    
  6. Since this is a public endpoint we don’t need to pass any tokens. Remove or comment out line 42

     // client.AddDefaultParameter("api_key", myfirstintegrationCrawlJobData.ApiKey, ParameterType.QueryString);`
    
  7. Add a method to retrieve users (you will need to import some namespaces too):

     public async Task<IList<User>> GetUsers() => await GetAsync<IList<User>>("users");
    
  8. In the src/MyFirstIntegration.Crawling/MyFirstIntegrationCrawler.cs you retrieve the data you want to insert in CluedIn. Add the following inside the GetData method:
         //retrieve data from provider and yield objects
    
         foreach( var user in client.GetUsers().Result)
         {
             yield return user;
         }
    
  9. In order to test the provider, you can use the Integration test provided. Open the test/integration/Crawling.MyFirstIntegration.Integration.Test/MyFirstIntegrationDataIngestion.cs file, and in the CorrectNumberOfEntityTypes method add a new annotation to indicate the expectation of receiving 10 Persons (that’s what the sample endpoint returns by default):
     [Theory]
     [InlineData("/Provider/Root", 1)]
     [InlineData("/Person", 10)]
     public void CorrectNumberOfEntityTypes(string entityType, int expectedCount)
    
  10. Execute the tests - they should all pass.

  11. Before adding the integration to CluedIn, open the file src\MyFirstIntegration.Core\MyFirstIntegrationConstants.cs and modify the values for the constants before the TODO comment. This information will be used in the GUI of CluedIn to show information about the integration. In particular you should set the CrawlerDescription, Integration, Uri (if this integration corresponds to an online tool), and IconResourceName. This last property corresponds to the path of an embedded resource in the Provider project.

Architecture

As you can see in the example - these are the main components:

  • A client that knows how to retrieve data from your source (e.g. MyFirstIntegrationClient.cs). It has methods to produce plain objects with the information.
  • The method GetData in the main Crawling class MyFirstIntegrationCrawler.cs - you can consider this as the entrypoint for the provider. This method will invoke the correct methods of the client, in order to yield plain objects.
  • A Vocabulary class (e.g. UserVocabulary.cs) which is for the most part generated automatically. This class defines the different keys of the data you are processing and how they map to generic terms (email, address, company) also in use in other sources. In addition it can define the relationship with other Vocabularies (also known as edges). For example the relationship between a user and a company.
  • A ClueProducer (e.g. UserClueProducer.cs) which essentially translates the plain object (retrieved by the client) into a clue, which is the object understood by CluedIn. It uses the keys from the Vocabulary to map the data from the object to the clue.

In this case the sample API was very open and generic, however in other cases you may need extra information (credentials, datasources, etc.) on how to connect to the source, or what data to retrieve. This can be captured in the CrawlJobData (e.g. MyFirstIntegrationCrawlJobData.cs). You can enrich it with whatever properties you need. However, you will also need to expand two methods in the Provider (e.g. MyFirstIntegrationProvider.cs):

  • GetCrawlJobData which translates the keys from a generic dictionary into the CrawlJobData object and
  • GetHelperConfiguration which performs the opposite translation (from the CrawlJobData to a dictionary)

Deploying the provider locally

If you are running CluedIn locally for testing purposes using Docker, you can follow these instructions to add the integration.

Testing the provider in your environment

Please refer to install an integration

Generating Models, Vocabularies and ClueProducers

Please refer to the FileGenerator GitHub Repository. This can be used to generate basic models, vocabularies and clue producers using one of three options: Metadata file; CSV files with data; Microsoft SQL Server. The generators need to be updated depending on each data source - more details can be found in the README section of the repository.