Enrich flat files with batches

📘

About this guide

This guide will explain how you can run a batch enrichment against a Data API. This assumes that you have a CSV of input data available on your machine to test.

This section will list the RESTful API calls to enrich a batch file (CSV) through a Data API. The Bearer(JWT) token for this enrichment will be generated through the same process (Username + Password that expires every 90 days sent to https://console.demystdata.com/jwt/create). The token can also be replaced with an API key generated through the Settings page of the platform. Demyst recommends IP whitelisting with an API key for security.

Getting an Input File ready

The input file for batch enrichment should contain only the data and no headers. The rows should be well separated through a newline and cells separated by a comma. If the separator(comma) exists as a value in a cell, stringify the contents of the CSV with double quotes. Below is an example of a file with only one column

Listing Region IDs

To start an enrichment, you will need to get the region ID for your enrichment. This reflects the region of the Connector or the Data API you made. An incorrect region could give you a Provider does not exist in this region error. Global connectors can resort to the region of your organisation.

curl --location --request GET 'https://console.demystdata.com/list_regions' \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer xxxxx'

The response will be a list of IDs and Region Names. For this example, we will go with the US region

[
  {
  "id":1,
  "name":"United States",
  "code":"us"
  }
]

Generate Pre-Signed URL for uploading Input File

For uploading the Input CSV File, we will generate a Pre-signed URL (active for 15 minutes) along with the S3 Object Key using the Region ID from above. A new Pre-signed URL should be generated once it expires. The process helps to securely scan and upload a file to the Demyst Platform.

curl -X GET 'https://console.demystdata.com/presigned_batch_upload_url' \
--header 'content-type: application/json' \
--header 'Accept: application/json' \
--header 'Authorization: Bearer xxxx' \
--data-raw '{
    "region_id": 1
}

The response will be a URL and Object Key as below

{
    "presigned_url": "PRESIGNED_URL_TO_UPLOAD_FILE",
    "s3_object_key": "batches/2ec89768-7183-1234-92a8-486961cb4716"
}

Uploading Input File to the Pre-Signed URL

Now that we have the URL, we can use it to upload our CSV file. There will be no response body but only a 200 response code if the file is successfully uploaded

curl --upload-file input_file.csv 'PRESIGNED_URL_TO_UPLOAD_FILE'

If you are trying to do this through POSTMAN or some other tool, change the upload cURL request to below and manually select the file/path to file.

curl --location --request PUT 'PRESIGNED_URL_TO_UPLOAD_FILE'  
--form 'data=@"/Users/demyst-user/Documents/input_file.csv"'

Defining Batch Input and Generating Batch Input ID

Now that we have successfully uploaded our file, we will need to define the Batch Input to get the batch input ID before we start this batch enrichment. To get the batch input id, you will need the following

  • List of Headers - The list should define the headers and must be as per Demyst Types [Add hyperlink here]
  • Name of the Batch Input - Demyst suggests unique names. You could use the S3 Object Key generated with the Upload URL and append it with your file name.
  • Num Rows - Total rows in your file for enrichment. In the example here it is 2
  • Region ID - As described above and 1 (US) in the example here
  • S3 Object Key - Generated with the Upload URL
          
curl -X POST https://console.demystdata.com/aegean_batch_inputs \
    -H 'content-type: application/json' \
    -H "authorization: Bearer xxxx" \
    -H 'accept: application/json' \
    -d '
      {
        "aegean_batch_input": {
          "headers": ["email_address"],
          "name": "batches/2ec89768-7183-1234-92a8-486961cb4716/input_file.csv",
          "num_rows": 2,
          "region_id": 1,
          "s3_object_key": "batches/2ec89768-7183-1234-92a8-486961cb4716"
        }
      }'

The response will provide you with the Batch Input ID which is needed for starting the Batch Enrichment. In this example, the ID is 45991

{  
    "id": 45991,  
    "name": "input_file.csv",  
    "region_id": 1,  
    "created_at": "2022-10-18T03:09:38.000Z",  
    "updated_at": "2022-10-18T03:09:38.000Z",  
    "headers": [  
        "email_address"  
    ],  
    "num_rows": 2,  
    "organization_id": 2  
}

Starting a Batch and getting Enrich ID

This is the last step for successfully starting your batch after uploading your Input file and getting a Batch ID. For this step, you will need the following details

  • Aegean Batch Input ID - The Batch ID generated from the previous step. Here it will be 45991. (Aegean is an internal Demyst term)
  • Name - The name of the batch enrichment is different from the name of the Aegean Batch Input that was described in the previous step. Demyst suggests having a name that is a combination of - Use-case/channel ID + username + date. Although, teams can set their own nomenclature.
  • Channel ID - This will be the Data API ID that you created. You can also run individual or a group of connectors if needed but Demyst recommends creating a Data API and using it to run.
  • Region ID - 1(US) in this example
          
curl -X POST https://console.demystdata.com/aegean_batch_runs \
    -H 'content-type: application/json' \
    -H "authorization: Bearer xxx"\
    -H 'accept: application/json' \
    -d '
      {
        "aegean_batch_run": {
          "aegean_batch_input_id": 45991,
          "name": "[email protected]",
          "channel_id": 6292,
          "region_id": 1
        }
      }'

After starting a batch, you will get the Batch Enrichment ID that can be used to track and download the output. Please note, this ID is different from the Batch Input ID (which is used to track the input files). Here, we get the batch enrichment ID as 45259 along with many other details.

              
{
    "id": 45259,
    "name": "[email protected]",
    "region_id": 1,
    "aegean_batch_input_id": 45991,
    "batch_uuid": "d134753f-588a-425e-8016-1c39408c9b80",
    "created_at": "2022-10-18T03:19:11.000Z",
    "updated_at": "2022-10-18T03:19:11.000Z",
    "state": "draft",
    "num_rows": 2,
    "aegean_batch_input": {
        "id": 45991,
        "name": "input_file.csv",
        "region_id": 1,
        "created_at": "2022-10-18T03:09:38.000Z",
        "updated_at": "2022-10-18T03:09:38.000Z",
        "headers": [
            "email_address"
        ],
        "num_rows": 2,
        "latest_s3_object": {
            "key": "batches/2ec89768-7183-1234-92a8-486961cb4716",
            "bucket": "demyst-inputs-p-mt-us-east-1",
            "region": "us-east-1"
        }
    },
    "organization": {
        "name": "Demyst Data",
        "credit_balance": 705525817
    },
    "user": {
        "email": "[email protected]"
    },
    "table_providers": [
        {
            "id": 15,
            "name": "domain_from_email",
            "created_at": "2017-08-23T14:05:10.000Z",
            "updated_at": "2022-10-18T00:00:40.000Z",
            "description": "Web domain validation and descriptive information. Sample Connector from Demyst to enable testing and integrations",
            "data_region_id": 1,
            "alias": "",
            "product_type": "Full Release",
            "is_input_file": false,
            "data_category": {
                "id": 4,
                "name": "Digital",
                "created_at": "2018-02-01T17:21:05.000Z",
                "updated_at": "2019-04-30T19:50:33.000Z"
            },
            "aegean_data_source": {
                "id": 677,
                "name": "DemystData",
                "website": "https://demyst.com/",
                "icon": {
                    "url": "https://demyst-public.s3.amazonaws.com/icons/aegean_data_source/677/icon/1611613545554.jpeg"
                },
                "description": "Frictionless External Data",
                "public_searchable": true
            },
            "tags": [
                {
                    "id": 200,
                    "name": "Digital - Other",
                    "created_at": "2021-11-08T16:05:54.000Z",
                    "updated_at": "2021-11-17T19:06:36.000Z",
                    "category": "SubEntity(Data-Category)",
                    "description": "Anything else Digital related not relevant to other Categories"
                }
            ]
        }
    ]
}

If you need to replace the channel ID with individual connectors or a list of connectors, use the below although this is not recommended by Demyst.

Replace channel_id with the below and the Version ID will be available by running another GET request using the connector/provider number. This is available for each connector on the catalog "provider_version_ids": [773]

The below request will give you provider_version_ids. Make sure to use the latest (largest ID) in the response. If you do it for 15 (domain_from_email connector, you will find the latest being 773). These extra steps can be removed by creating a Data API first

curl -X GET 'https://console.demystdata.com/table_providers/15/provider_versions' \
--header 'Accept: application/json' \
--header 'Content-Type: application/json' \
--header 'Authorization: Bearer xxxxxx'

Tracking Batch Status using Enrich ID

You can track the status using the Batch Enrichment ID which will be 45259 in this example as we created the same from the last call above.

curl -X GET 'https://console.demystdata.com/aegean_batch_runs/45259' \
--header 'Authorization: Bearer xxxx' \
--header 'Content-Type: application/json' \
--header 'Accept: application/json'

The state in the response could be one of the following and it usually progresses in the following order

StateDescription
draftIf you set the batch to draft, it will stay in the draft unless the draft is set to False. This can be used to upload a file and get the batch ready and start it when the time is right
pending_approvalFor some users, approval is needed by an admin user of the org and the batch will stay in this state unless a manual approval is given
uploadingThe Input file uploaded to the S3 location will be uploaded to the database for (usually 20 rows at a time) for letting the batch technology that the inputs are ready to be sent to the connectors. This step takes a few seconds, sometimes depending on the input file size, but is always instantaneously
uploadedOnce the upload is done, it will momentarily be in the uploaded state before moving on.
preprocessingIn this state, the input is getting parsed and converted to upstream types for sending the request. This is also instantaneous
preprocessedSame as the uploaded state, will confirm that the preprocessing is done before moving on.
runningThis is the main state where enrichment is going on. Depending on the connector, the time could vary. Demyst-hosted connectors will be faster.
completeOnce the enrichment is done, you will see a complete state and the output file link will be ready in this state
pausedYou can pause your enrichment if any issues and restart if needed.
errorAny access, enrichment, or input file error will be reflected through this state. You cannot go back once you are in this state but will need to make corrections and start a new batch enrichment
canceledIf you cancel your enrichment, you will see this state. Similar to draft, paused, and pending_approval this needs to be done manually

Downloading the Batch Output File

Once the batch enrichment state is completed, you can use the same call as above and look for the below attributes to download the output files (separated by connectors and refine). Look into batch_run_provider_versions ['most_recent_export_link'] to find the pre-signed URL for downloading the files. If the URL is not working, you need to make the call again to get a new pre-signed URL. The following will apply to the URL:

  • Expires in 15 minutes but can be re-generated using a call to track the status
  • The file will expire depending on the file purge setting on the platform. If this is not set, the default will be 30 days.

If you face any issues or need elevated access, reach out to your account manager or [email protected] for any concerns.