Aller au contenu principal

Dataiku (Data Exploration GUI)

Dataiku is a data science and machine learning platform designed to help organizations manage, analyze, and model data. It provides a collaborative environment for data scientists, analysts, and business users to work together across the data pipeline, from data preparation to model deployment.

Installation

Dependencies

Dataiku is provided as a Virtual Machine, not as a container image. There is no official support for containers, nor will there be in the forseeable future.

This is why we require KubeVirt to deploy our solution.

Configuration

Dataiku requires a lot of configuration and it is done in different steps.

Password

The VM has a user called dataiku with a default password set to dataiku.

However When the VM is launched at boot, the password will be set to a random value read from the secret dataiku-secrets.vmPassword

Network

Kubevirt assigns random mac address at every VMI creation. Meaning recreating a VMI breaks network because the expected mac address has changed.

To solve this, we delete the /etc/netplan/50-cloud-init.yaml which houses the expected mac address then run dhclient to obtain a new dynamic IP address.

Container execution

To avoid putting a lot of strain on the VM itself, Dataiku can offload processing jobs to containers that are executed in the same Kubernetes cluster.

This requires some configuration

Setup ports

To be able to communicate with these containers, Dataiku chooses ports from a wide range by default (1024-65536). This is unsuitable for security and scalability reasons. So we use sysctl to overwrite the local port range and the already reserved ports.

The default configuration will start from the main Dataiku port 11000 and keep 10 ports for dataiku and add 30 for container executions. You can overload it with these settings:

containerExecution:
portRange:
start: 11000
end: 11041
Other configuration

You can configure the rest using the default configuration as reference:

containerExecution:
name: "k8s-container"
description: "Kubernetes containerized execution environment"
setAsDefault: true # use this container execution by default instead of ones manually created by users
overwriteConfig: true # overwrite the configuration at every VM reboot
config:
rancherUrl: https://rancher.dh.artemis-ia.fr # kubernetes cluster url
repositoryURL: "gitea.dh.artemis-ia.fr/athea" # repository for pulling the image
dockerImage: "dataiku-container:13.3.1"
#kubernetesNamespace: "dataiku-containers" Overload the namespace, by default Dataiku uses the same namespace it is deployed in
kubernetesResourceRequestsCpu: 1
kubernetesResourceRequestsMemory: 4096
kubernetesResourceLimitsCpu: 2
kubernetesResourceLimitsMemory: 8192

Docker

Dataiku requires docker to be able to use containerized executions for things like visual recipes, kernel notebooks, and code environments

This means it must be able to push and pull from a registry.

docker:
registry: "gitea.dh.artemis-ia.fr"
username: "user"
password: "password"
insecure: False # whether your registry is HTTP or HTTPS to be able to configure docker's daemon.json

SSO & Groups

We also need to prepare custom groups with fine grained control as well as OIDC SSO to synchronize Keycloak users and groups.

First we need to create local Dataiku groups with fine grained control. You can use this example for creating 2 groups one of them admin:

groups:
- name: "sso_data_team"
description: "(SSO) Default group for read/write access to projects"
dataikuProfile: DESIGNER
ldapGroupNames: []
azureADGroupNames: []
ssoGroupNames:
- "sso_data_team"
customGroupNames: []
sourceType: "LOCAL_NO_AUTH"
admin: false
mayManageUDM: false
mayCreateProjects: false
mayCreateProjectsFromMacros: false
mayCreateProjectsFromTemplates: false
mayCreateProjectsFromDataikuApps: false
mayWriteUnsafeCode: false
mayWriteSafeCode: true
mayCreateAuthenticatedConnections: false
mayCreateCodeEnvs: false
mayCreateClusters: false
mayCreateCodeStudioTemplates: false
mayDevelopPlugins: false
mayEditLibFolders: false
mayManageCodeEnvs: false
mayManageClusters: false
mayManageCodeStudioTemplates: false
mayViewIndexedHiveConnections: false
mayCreatePublishedAPIServices: false
mayCreatePublishedProjects: false
mayWriteInRootProjectFolder: true
mayCreateActiveWebContent: true
mayCreateWorkspaces: false
mayShareToWorkspaces: true
mayCreateDataCollections: true
mayPublishToDataCollections: true
mayManageFeatureStore: false
canObtainAPITicketFromCookiesForGroupsRegex: ""
- name: "sso_admin_team"
description: "(SSO) Admin group for read/write access to projects"
dataikuProfile: "PLATFORM_ADMIN"
ldapGroupNames: []
azureADGroupNames: []
ssoGroupNames:
- "sso_admin_team"
customGroupNames: []
sourceType: "LOCAL_NO_AUTH"
admin: true
mayManageUDM: true
mayCreateProjects: true
mayCreateProjectsFromMacros: true
mayCreateProjectsFromTemplates: true
mayCreateProjectsFromDataikuApps: true
mayWriteUnsafeCode: true
mayWriteSafeCode: true
mayCreateAuthenticatedConnections: true
mayCreateCodeEnvs: true
mayCreateClusters: true
mayCreateCodeStudioTemplates: true
mayDevelopPlugins: true
mayEditLibFolders: true
mayManageCodeEnvs: true
mayManageClusters: true
mayManageCodeStudioTemplates: true
mayViewIndexedHiveConnections: true
mayCreatePublishedAPIServices: true
mayCreatePublishedProjects: true
mayWriteInRootProjectFolder: true
mayCreateActiveWebContent: true
mayCreateWorkspaces: true
mayShareToWorkspaces: true
mayCreateDataCollections: true
mayPublishToDataCollections: true
mayManageFeatureStore: true
canObtainAPITicketFromCookiesForGroupsRegex: ""

Then we can finally set up the SSO using classic configuration:

sso:
enabled: true
clientId: "dataiku"
clientSecret: "zWLdrQzxzsPFpe53LNo9zLLugcXIXM18"
scope: "openid email"
issuer: "https://auth.dh.artemis-ia.fr/realms/kosmos"
authorizationEndpoint: "https://auth.dh.artemis-ia.fr/realms/kosmos/protocol/openid-connect/auth"
tokenEndpoint: "http://keycloak-cluster-service.kosmos-iam/realms/kosmos/protocol/openid-connect/token"
jwksUri: "http://keycloak-cluster-service.kosmos-iam/realms/kosmos/protocol/openid-connect/certs"
claimKeyIdentifier: "preferred_username"
claimKeyDisplayName: "name"
claimKeyEmail: "email"
enableGroups: true # ask dataiku to usie group claim from keycloak generated token
claimKeyGroups: "groups"
useGlobalProxy: true
usePKCE: true
tokenEndpointAuthMethod: "CLIENT_SECRET_BASIC"

What's left

There a few key points left to prepare for dataiku:

  • Pods, jobs and secrets created by Dataiku are not cleaned, we should put in place a job that cleans old ones. And deletes secrets that are not attached to running pods.

  • File kernels.json not created automatically so the user has to go into settings > Containerized Execution > (Re)Install Jupyter Kernels.

  • The notebook kernels have a mismatch in chosen ports for communiation. The Dataiku VM will choose adequate ports, but the pod will choose different ones.

    Errors for debugging

    Traceback (most recent call last):
    File "/opt/dataiku-dss-13.3.1/python310.packages/tornado/web.py", line 1786, in _execute
    result = await result
    File "/usr/lib/python3.10/asyncio/tasks.py", line 304, in __wakeup
    future.result()
    File "/opt/dataiku-dss-13.3.1/python310.packages/tornado/gen.py", line 780, in run
    yielded = self.gen.throw(exc)
    File "/opt/dataiku-dss-13.3.1/dku-jupyter/packages/notebook/services/sessions/handlers.py", line 79, in post
    model = yield maybe_future(
    File "/opt/dataiku-dss-13.3.1/python310.packages/tornado/gen.py", line 767, in run
    value = future.result()
    File "/opt/dataiku-dss-13.3.1/python310.packages/tornado/gen.py", line 780, in run
    yielded = self.gen.throw(exc)
    File "/opt/dataiku-dss-13.3.1/dku-jupyter/packages/notebook/dataiku/sessionmanager.py", line 92, in create_session
    kernel_id = yield self.start_kernel_for_session(session_id, path,
    File "/opt/dataiku-dss-13.3.1/python310.packages/tornado/gen.py", line 767, in run
    value = future.result()
    File "/opt/dataiku-dss-13.3.1/python310.packages/tornado/gen.py", line 780, in run
    yielded = self.gen.throw(exc)
    File "/opt/dataiku-dss-13.3.1/dku-jupyter/packages/notebook/services/sessions/sessionmanager.py", line 119, in start_kernel_for_session
    kernel_id = yield maybe_future(
    File "/opt/dataiku-dss-13.3.1/python310.packages/tornado/gen.py", line 767, in run
    value = future.result()
    File "/usr/lib/python3.10/asyncio/futures.py", line 201, in result
    raise self._exception.with_traceback(self._exception_tb)
    File "/usr/lib/python3.10/asyncio/tasks.py", line 232, in __step
    result = coro.send(None)
    File "/opt/dataiku-dss-13.3.1/dku-jupyter/packages/notebook/services/kernels/kernelmanager.py", line 178, in start_kernel
    kernel_id = await maybe_future(self.pinned_superclass.start_kernel(self, **kwargs))
    File "/opt/dataiku-dss-13.3.1/python310.packages/jupyter_client/multikernelmanager.py", line 185, in start_kernel
    km, kernel_name, kernel_id = self.pre_start_kernel(kernel_name, kwargs)
    File "/opt/dataiku-dss-13.3.1/python310.packages/jupyter_client/multikernelmanager.py", line 170, in pre_start_kernel
    km = self.kernel_manager_factory(connection_file=os.path.join(
    File "/opt/dataiku-dss-13.3.1/python310.packages/jupyter_client/multikernelmanager.py", line 87, in create_kernel_manager
    km.shell_port = self._find_available_port(km.ip)
    File "/opt/dataiku-dss-13.3.1/python310.packages/jupyter_client/multikernelmanager.py", line 101, in _find_available_port
    tmp_sock.bind((ip, 0))
    OSError: [Errno 98] Address already in use

Usage