Skip to content

API

We are going to walk through all of the available methods in the Demyst Analytics Python package. This will give you a broad overview of the features and capabilities of the package.

Analytics

The Analytics class drives all of the methods that help you access external data. Generally you want to instantiate a separate Analytics object for each data study.

Examples

Username & Password Authentication

The best way to get started is to let the toolkit prompt you for your username and password. If you don't have credentials you can sign up here.

from demyst.analytics import Analytics

# If you don't pass in any parameters, you will be prompted for username and password.
analytics = Analytics()

Key-based Authorization

For non-interactive scripts, use the key parameter to pass in your API key.

from demyst.analytics import Analytics

# Pass in your API key with the key parameter.
analytics = Analytics(key="XXXXXXXXXXXXXXXXXXX")
More details on Analytics()

class Analytics(kwargs***)

Argument Defaults Notes
inputs {} Default input DataFrame to use
region "us" Which of the global edges to use: us, sg, au
username None If None provided, then prompted
password None If None provided, then prompted
sample_mode True Return test data, set to false for live mode
config_file None Config file that stores these options
key None For non-interactive use

.validate

Checks whether the input dataframe's column names and values would be accepted by the Demyst system.

You can run this as a quick preflight check before kicking off an enrichment job.

Examples

Validating CSVs

For non-interactive scripts, use the key parameter.

from demyst.analytics import Analytics

analytics = Analytics()

inputs = pd.read_csv('inputs.csv', dtype = {'phone': object, 'post_code': object})

analytics.validate(inputs)

phone post_code
0  15555555555     10010

validate

Because the columns (phone, post_code) match the Demyst Types (Phone, PostCode), validation was successful.

More details on analytics.validate()

analytics.validate(inputs, providers=None, notebook=True)

Argument Defaults Notes
inputs None Required, unless provided to Analytics()
providers [] List of Data Products to validate against
notebook True Produce HTML report, or Boolean if false

Results: If notebook is true, returns an HTML object suitable for Jupyter notebook display. Otherwise returns a boolean indicating whether validation succeeded.

Looks for providers that are able to return data for the provided inputs.

Use this when you have some data and want to see which of our data providers might be able to use it.

Examples

Searching providers

For non-interactive scripts, use the key parameter.

from demyst.analytics import Analytics

analytics = Analytics()

inputs = pd.read_csv('inputs.csv', dtype = {'phone': object, 'post_code': object})

analytics.search(inputs)

# This will output a nicely-formatted list of providers to the notebook
More details on analytics.search()

analytics.search(inputs=None, tags=None, notebook=True, strict=False)

Argument Defaults Notes
inputs None Required, unless provided to Analytics()
tags None List of tags to search for
notebook True Produce HTML report if true, list of found providers otherwise
strict False If true, only return providers for which all inputs are present

Results: If notebook is true, returns an HTML object suitable for Jupyter notebook display, otherwise returns a list of result objects.

Looks for data providers which contain the provided attribute.

If you are looking for a certain attribute and need to know which providers have them, use the attribute_search. It will list all of the providers which contain that attribute in their response.

Examples

Searching for an Attribute

In this example, we will look for the attribute NAICS and which providers can provide me with NAICS (North American Industry Classification System) for the business.

from demyst.analytics import Analytics

analytics = Analytics()

analytics.attribute_search(name="naics")

#This will print a list (in dataframe format) of the providers

The resulting dataframe containing providers and attribute names looks like:

  attribute     provider
0   naics_codes   experian_business_facts
1   primary_naics   equifax_austin_tetra_details

.enrich_and_download

Augments your input data with results from our data providers. This is the main entry point to the Demyst data platform.

enrich_and_download is actually a convenience wrapper around the more primitive functionality provided by enrich, enrich_wait, and enrich_download. We recommend that you use enrich_and_download to get started, and switch to those other methods later, e.g. when you have lots of data to process.

Examples

Enriching an input dataframe

This example uses enrich_and_download to augment an input dataframe containing some email addresses with our built-in domain_from_email data provider that simply splits the addresses into username and hostname and returns those.

from demyst.analytics import Analytics
import pandas as pd

analytics = Analytics()

inputs = pd.DataFrame.from_dict([
    { "email_address": "foo@example.com" },
    { "email_address": "test@test.com" }
])

# Here we only use a single data provider, but you can pass in
# any number of data provider names to use.
results = analytics.enrich_and_download(["domain_from_email"], inputs)
print(results)

The resulting dataframe looks like this:

  inputs.email_address  domain_from_email.row_id domain_from_email.client_id  \
0      foo@example.com                         0
1        test@test.com                         1

  domain_from_email.host domain_from_email.user domain_from_email.error
0            example.com                    foo
1               test.com                   test

Note that your input column email_address was mirrored back in a prefixed form as inputs.email_address.

The columns starting with domain_from_email were added by the data provider. While this example is somewhat contrived, it shows the basic workings of enrichment: you pass in a dataframe and the names of some providers to use, and get back a dataframe containing additional data from the providers.

More details on analytics.enrich_and_download()

analytics.enrich_and_download(providers, inputs, validate=True)

Argument Defaults Notes
providers List of provider names to query
inputs Inputs to pass to providers
validate True Perform validation before enrichment
all_updates False Include historical data in results

Results: Returns the enriched dataframe.

.enrich

enrich is the lower-level (compared to enrich_and_download) workhorse that lets you kick off an enrichment job asynchronously. It immediately returns a job ID, which you can use with our other methods:

  • Manually check the status of the job with enrich_status.

  • Wait for the job to finish with enrich_wait.

  • Download the results with enrich_download. You can even download partial results while the job is still running.

Use enrich for long-running jobs with real data; if you're just getting started we recommend to use enrich_and_download which runs synchronously and does all of that for you.

Examples

Manual control over enrichment

We're re-using the example from enrich_and_download, but use enrich which doesn't block the notebook and thus allows us to keep working while the enrichment is in progress.

from demyst.analytics import Analytics
import pandas as pd

analytics = Analytics()

inputs = pd.DataFrame.from_dict([
    { "email_address": "foo@example.com" },
    { "email_address": "test@test.com" }
])

# This kicks off the job... once it prints the job ID you can continue working.
job_id = analytics.enrich(["domain_from_email"], inputs)

# If you want to inquire about the status of the job, do the following.
# This will print some status information and return true if the job is finished.
finished = analytics.enrich_status(job_id)

# You can also wait for the job to finish:
analytics.enrich_wait(job_id)

# Now we're ready to download the data:
results = analytics.enrich_download(job_id)
More details on analytics.enrich()

analytics.enrich(providers, inputs, validate=True)

Argument Defaults Notes
providers List of provider names to query
inputs Inputs to pass to providers
validate True Perform validation before enrichment
all_updates False Include historical data in results

Results: Returns the ID of the started enrichment job.

.enrich_status

enrich_status returns true if an enrichment job created with enrich is complete, false if it's still running. It also prints some information about job progress.

Examples

See the example for enrich.

More details on analytics.enrich_status()

analytics.enrich_status(id)

Argument Defaults Notes
id Job ID from enrich()

Results: Returns true if the job is complete, false if it's still running.

.enrich_wait

enrich_wait waits until an enrichment job created with enrich is complete. It's similar to running enrich_status in an infinite loop.

Examples

See the example for enrich.

More details on analytics.enrich_wait()

analytics.enrich_wait(id)

Argument Defaults Notes
id Job ID from enrich()

Results: None.

.enrich_download

enrich_download downloads the augumented data of an enrichment job created with enrich and returns the resulting dataframe.

By default, enrich_download will wait until the results are complete, but it also lets you download partial results while the job is still running. To do this, pass block_until_complete=False to enrich_download.

Examples

See the example for enrich.

More details on analytics.enrich_download()

analytics.enrich_download(id)

Argument Defaults Notes
id Job ID from enrich()
block_until_complete True Wait for all providers to finish if True, download partial results otherwise.

Results: Returns the enriched dataframe.

.enrich_credits

enrich_credits prints information about the cost of running an enrichment.

Use this to see how many credits a job would take before running it.

It has the same parameters as enrich.

Examples

Getting credit information

Here we're re-using the example from enrich, but instead of actually running the job, we just print how many credits it would take.

from demyst.analytics import Analytics
import pandas as pd

analytics = Analytics()

inputs = pd.DataFrame.from_dict([
    { "email_address": "foo@example.com" },
    { "email_address": "test@test.com" }
])

# Don't actually run the job, just print how many credits it would take.
print(analytics.enrich_credits(["domain_from_email"], inputs))
More details on analytics.enrich_credits()

analytics.enrich_credits(providers, inputs, validate=True)

Argument Defaults Notes
providers List of provider names to query
inputs Inputs to pass to providers
validate True Perform validation before enrichment

Results: Returns the number of the credits running the job would cost.

.products

products returns a JSON list with information about each of our data providers.

Examples

Listing all data providers

This example shows how to list all data providers.

from demyst.analytics import Analytics
import json
a = Analytics()
print(json.dumps(a.products(), indent=2))

This prints the following:

[
  {
    "id": 4,
    "name": "clear_business_search",
    "description": "Company location, registration information and other pertinent business information as found by Thomson Reuters",
    "data_source": "ThomsonReuters",
    "website": "https://www.thomsonreuters.com/en.html",
    "category": "Commercial, SME Pre-fill, Verification, and Risk",
    "region": "United States",
    "p95": 1.2,
    "p99": 1.24,
    "uptime": 98.0,
    "match_rate": 72.0,
    "fcra": false,
    "tls": true,
    "post": true,
    "footprint": false,
    "beta": false,
    "tags": [
      {
        "id": 74,
        "name": "Address Verification"
      },
      {
        "id": 79,
        "name": "Business Identity Verification"
      },
      {
        "id": 146,
        "name": "SIC Code"
      }
    ],
    "sample_data_preview": false,
    "price_final": false,
    "limited_labs": true,
    "coming_soon": true,
    "credit_cost": 0.14,
    "cost": 0.01,
    "logo": "https://console.demystdata.com/assets/thomsonreuters-64b80dcd89ac2b7afa2b61b4d43e06a32ae8284b678d4d581573c5cdfd91e072.svg",
    "online": true,
    "custom_categories": [],
    "recently_added": false,
    "contact": "",
    "processing_time": "Instant",
    "featured": false,
    "terms_accepted": false,
    "terms_touched_by": null,
    "approved": false
  },
  ...

You can use the full parameter to include even more information about each data provider, namely its inputs and outputs. We don't include that information by default because the output gets huge. You can use product_catalog to get this information for a single provider.

You can use the {"stats_available":True} to filter out any product where product_stats are not available. See product_stats for more information.

You can use the dataframe parameter to view the products and metadata as a Pandas DataFrame.

Listing data providers in a DataFrame

This example shows how to create a DataFrame with each provider.

from demyst.analytics import Analytics
import json
a = Analytics()
a.products(dataframe=True)
                                  Product Name  Category                                     Tags  ...
0                                   json_whois  Social Media, Digital, and Online Footprint  Business, Web, Internet data  ...
1                              white_pages_pro  Consumer Pre-fill, Verification, and Fraud   Address, Identity, Score, Phone  ...
2                  white_pages_pro_find_person  Consumer Pre-fill, Verification, and Fraud   Consumer, KYC, Lat/lon, Address, Fraud, Identi...  ...
3                white_pages_pro_reverse_phone  Consumer Pre-fill, Verification, and Fraud   Consumer, KYC, Lat/lon, Address, Identity, Sco...  ...
4                                   seon_fraud  Consumer Pre-fill, Verification, and Fraud   Fraud, Consumer, Identity, Score  ...
5                                     seon_geo  Property Detail, Location and Geospatial     Location, Lat/lon, Geolocation, Address, Web  ...
6                                   seon_email  Other                                        Identity, Email  ...
...
More details on analytics.products()

analytics.products({}, full=False)

Argument Defaults Notes
{"product_stats":True} {} Only return products where product_stats are available.
full False Include information about inputs and outputs for each provider.
dataframe False Return products listed in a DataFrame.

Results: Returns a list with information about all data providers.

.product_catalog

product_catalog returns JSON information about the inputs and outputs of a data provider.

Examples

Getting information about domain_from_email

This example shows how to get information about the domain_from_email data provider.

from demyst.analytics import Analytics
import json
a = Analytics()
print(json.dumps(a.product_catalog("domain_from_email"), indent=2))

This prints the following:

[
  {
    "input": {
      "required": [
        [
          {
            "name": "email_address",
            "type": "EmailAddress"
          }
        ]
      ],
      "optional": []
    },
    "output": [
      {
        "name": "host",
        "type": "Domain",
        "description": "Domain provider"
      },
      {
        "name": "user",
        "type": "String",
        "description": "User name"
      }
    ]
  }
]
More details on analytics.product_catalog()

analytics.product_catalog(provider)

Argument Defaults Notes
provider Name of provider

Results: Returns JSON information about inputs and outputs of provider.

.product_stats

product_stats accepts an array of data products as an argument and returns a dataframe of performance metrics and metadata for each of those products' fields. The rows in the dataframe are the output fields of the products, and the columns are consistency_rate, entity_name error_rate, field_is_populated_rate, flattened_name, generic_flattened_name, hit_rate, last_updated_at, num_distinct_values, and product.

Getting performance statistics for three products

This example shows how to get product stats on each output field for dnb_find_company, housecanary_property_details, and infutor_property_append.

from demyst.analytics import Analytics

analytics = Analytics()
providers = ["dnb_find_company", "housecanary_property_details", "infutor_property_append"]
stats = analytics.product_stats(providers)
print(stats)

The resulting dataframe looks like the following:

  consistency_rate         entity_name  error_rate  field_is_populated_rate  \
0             None  us_business_entity         0.0                 0.558719
1             None  us_business_entity         0.0                 0.555160

                                      flattened_name generic_flattened_name  \
0  find_company_response_detail.candidate_matched...                   None
1  find_company_response_detail.candidate_returne...                   None

   hit_rate      last_updated_at  num_distinct_values           product
0  0.558719  2019-03-05 15:49:52                   55  dnb_find_company
1  0.558719  2019-03-05 15:49:52                   18  dnb_find_company
More details on analytics.product_stats()

analytics.product_stats(providers)

Argument Defaults Notes
products List of provider names to view stats

Results: Returns the performance data and metadata of products' fields.

report

report accepts an input dataframe and the response dataframe from the enriched methods. Report will provide you with statistical data at product and attribute level. Each row will contain the response attribute from enriched methods and various details entailing to the data in the attribute. This includes the type, fill_rate and unique number of values(nunique) in the attribute and on the product level it will include the match_rate.

Getting statistics from the enriched data

This example shows how to get stats on each attribute for enriched data from seon_email and neutrino_email_verify.

from demyst.analytics import Analytics
from demyst.analytics.report import *

analytics = Analytics()

inputs = pd.DataFrame.from_dict([
    { "email_address": "foo@example.com" },
    { "email_address": "test@test.com" }
])

providers = ["seon_email", "neutrino_email_verify"]
result = analytics.enrich_and_download(providers, input)
stats = report(inputs, result)
print(stats)

The resulting dataframe looks like the following:

               connector                              attribute   type \
0             seon_email  email_account_details.linkedin_exists object
1  neutrino_email_verify                            is_verified   bool

    match_rate  fill_rate   nunique  \
0          1.0        0.5         1
1          1.0        1.0         2

stats.query

stats.query can take filter operations on any of the stats from report. It will provide the same fields, connector, attribute, match_rate, fill_rate, nunique, as report but with filtered content.

Getting filtered stats from the enriched data

This example shows how to filter stats on each attribute for enriched data from seon_email and neutrino_email_verify.

from demyst.analytics import Analytics
from demyst.analytics.report import *

analytics = Analytics()

inputs = pd.DataFrame.from_dict([
    { "email_address": "foo@example.com" },
    { "email_address": "test@test.com" }
])

providers = ["seon_email", "neutrino_email_verify"]
result = analytics.enrich_and_download(providers, input)
stats = report(inputs, result)
filtered = stats.query('match_rate >= .8 & fill_rate >= .5 & nunique >= 2')
print(filtered)

The resulting dataframe looks like the following:

               connector                       attribute    type \
0             seon_email     email_domain_details.domain  object
1  neutrino_email_verify                     is_verified    bool

    match_rate  fill_rate   nunique  \
0          1.0        0.5         2
1          1.0        1.0         2

Types

At the heart of the Demyst Platform is its type system.

Types are associated with column names. For example, a column named post_code is expected to contain a postal code.

Data Type Description Example
blob Base64-encoded binary data RGVteXN0
business_name The name of a company Demyst Data Ltd.
city The name of a city New York City
country Must be a 2 or 3 character iso code https://en.wikipedia.org/wiki/ISO_3166-1_alpha-3 or https://en.wikipedia.org/wiki/ISO_3166-1_alpha-2 US, AU, SG
domain An internet domain name demyst.com
email_address An email address support@demyst.com
first_name A first name John
full_name A full name John Doe
gender A gender or abbreviation m, male, f, female
ip4 IP address (version 4) 192.168.0.1
last_name A last name Smith
latitude Number between -90.090.0 40.7
longitude Number between -180.0180.0 -73.9
marital_status A marital status or abbreviation m, married, s, single, ...
middle_name A middle name Rupert
number A number. Supports integral and decimal numbers of arbitrary size and precision 42
percentage A number between 0.0 and 100.0 99%, 99
phone Country dependent, for US must be 10 digits without leading one or 11 digits with, area code must be valid 917-475-1881
post_code If US 5 or 9 digit postcode, dash or no dash separating. other countries need be non empty 10001
sic_code A Standard Industrial Classification code. 4 digit character string 2024
state If US it must be a valid 2 character state code or state name. Empty otherwise NY, New York
street Non-empty. A street name 100 Main St
string A character string foo
url A Uniform Resource Locator. Starts with http: or https: https://www.demyst.com
us_ein An Employer Identification Number. Dashes and spaces stripped from input by us, must be 9 numeric character string 12-3456789
us_ssn A Social Security Number. Dashes and spaces stripped from input by us, must be 9 numeric character string 078-05-1120
us_ssn4 The last four digits of a Social Security Number 1120
year_month A particular month of a year. In format yyyy-MM 2019-01
year A year 2019