Skip to content

Automating Enrichment Jobs

Install

First you need to make sure you have the Analytics package installed. If you aren't sure try running this:

pip install demyst-analytics

Test Data

First, lets create some test data to use in this example. In an IPython environment or in a Python script, execute this code:

import pandas as pd
test_df = pd.DataFrame({'email_address': ['test@test.com', 'test2@test.com']})
test_df.to_dense().to_csv("inputs.csv", index = False, sep=',', encoding='utf-8')

You should end up with a file called inputs.csv that looks like this:

email_address
test@test.com
test2@test.com

Automation

Now that we have some test data, let's build a script to enrich our input file using the Demyst platform. For purposes of this test we are going to be using the domain_from_email data product, which is a test product Demyst offers that simply splits up email_address columns sent to it.

Let's start by importing the necessary packages.

import pandas as pd
from demyst.analytics import Analytics

You will need a production API Key from the Demyst Console.

analytics = Analytics(key='XXXXXX')

If you don't have an API Key yet, you can test using your Username and Password by leaving out the key parameter.

analytics = Analytics()

Now let's read in our inputs file. Because our CSV file has a header that is understood by the Demyst platform email_address, the file can be used as a dataframe without modification.

inputs = pd.read_csv('inputs.csv')

To enrich the file, we pass the list of providers along with the input dataframe to the enrich function.

job_id = analytics.enrich(['domain_from_email'], inputs, validate=False)

The enrich_download function will block until the job is complete and return a dataframe:

outputs = analytics.enrich_download(job_id)

Lastly, we can take the resulting ouput dataframe, and write it to a file.

outputs.to_dense().to_csv('outputs.csv', index = False, sep=',', encoding='utf-8')

The output of this script will be a file called outputs.csv which should look like this:

inputs.email_address,domain_from_email.row_id,domain_from_email.client_id,domain_from_email.host,domain_from_email.user,domain_from_email.error
test@test.com,0,,test.com,test,
test2@test.com,1,,test.com,test2,

This output could be for the next stage of ETL pipeline or it could be imported into a modeling tool.

The full solution is provided below. If you need help automating a production job, don't hesitate to reach out to support@demystdata.com.

import pandas as pd
from demyst.analytics import Analytics

analytics = Analytics()

inputs = pd.read_csv('inputs.csv')

job_id = analytics.enrich(['domain_from_email'], inputs, validate=False)
outputs = analytics.enrich_download(job_id)

outputs.to_dense().to_csv('outputs.csv', index = False, sep=',', encoding='utf-8')