Configuring yente

The Yente service is built to require a minimum of configuration, but several environment variables can be used to define the search provider to use, and to define a custom data manifest.

yente: Intro · Deployment · Settings · Custom datasets · FAQ

The API server has a few operations-related settings, which are passed as environment variables. The settings include:

Env. variableDefaultDescription
YENTE_MANIFESTmanifests/default.ymlSpecifies the path of the manifest that defines the datasets exposed by the service. This is used to add extra datasets to the service or to define custom scopes for entity screening.
YENTE_CRONTAB0 * * * *Gives the frequency at which new data will be indexed as a crontab.
YENTE_AUTO_REINDEXtrueCan be set to false to disable automatic data updates and force data to be re-indexed only via the command line (yente reindex).
YENTE_UPDATE_TOKENunsafe-defaultShould be set to a secret string. The token is used with a POST request to the /updatez endpoint to force an immediate re-indexing of the data.
YENTE_INDEX_TYPEelasticsearchShould be one of elasticsearch or opensearch, depending on what provider you use.
YENTE_INDEX_URLhttp://index:9200The URL of your search index provider backend.
YENTE_INDEX_NAMEyenteThe prefix name that will be used for the search index.
YENTE_ELASTICSEARCH_CLOUD_ID-If you are using Elastic Cloud and want to use the ID rather than endpoint URL.
YENTE_OPENSEARCH_REGION-Specifies your region if you are using AWS hosted OpenSearch.
YENTE_OPENSEARCH_SERVICE-Should be aoss if you are using Amazon OpenSearch Serverless Service and es if you are using the default Amazon OpenSearch Service.
YENTE_INDEX_USERNAME-Username for the search provider. Required if connection using Elastic Cloud.
YENTE_INDEX_PASSWORD-Elasticsearch password. Required if connection using Elastic Cloud.
YENTE_HTTP_PROXY-Set a proxy for Yentes outgoing HTTP requests.
YENTE_MAX_BATCH100How many entities to accept in a /match batch at most.
YENTE_MATCH_PAGE5How many results to return per /match query by default.
YENTE_MAX_MATCHES500How many results to return per /match query at most.
YENTE_MATCH_CANDIDATES10How many candidates to retrieve as a multiplier of the /match limit.
YENTE_MATCH_FUZZYtrueWhether to run expensive Levenshtein queries inside ElasticSearch.
YENTE_QUERY_CONCURRENCY10How many match and search queries to run against ES in parallel.
YENTE_DELTA_UPDATEStrueWhen set to false Yente will download the entire dataset when refreshing the index.
YENTE_STREAM_LOADtrueIf set to false, will download the full data before indexing it. This improves the stability of the indexer but requires some local disk cache space.

Managing data updates

By default, yente will check for an updated build of the OpenSanctions database published at data.opensanctions.org every hour. If a fresh version is found, an indexing process will be spawned and load the data into the ElasticSearch index.

You can change this behavior in two ways:

  • Specify a crontab for YENTE_CRONTAB in your environment in order to run the auto-update process at a different interval. Setting the environment variable YENTE_AUTO_REINDEX to false will disable automatic data updates entirely.
  • If you wish to manually run an indexing process, you can do so by calling the script yente reindex. This command must be invoked inside the application container. For example, in a docker-compose based environment, the full command would be: docker-compose run yente reindex.

The production settings for api.opensanctions.org use these two options in conjunction to move reindexing to a separate Kubernetes CronJob that allows for stricter resource management.