Skip to content

AutoML API

This document describes each endpoint of VecML RESTful API for Automated Maching Learning (AutoML), including:

  • Endpoint URL
  • Expected Request Body (JSON)
  • Response Format (JSON)
  • Relevant Details or Constraints
  • Example Requests / Responses

All RESTful API calls to VecML cloud database should be sent to: https://aidb.vecml.com/api. All endpoints require a JSON body unless otherwise noted.

Rate and request size limit

For resource allocation and server stability, the server enforces a rate limit on the number of API calls per second. If exceeded, the server responds with:

400 Bad Request
Request limit reached, please try again later
Please avoid high-frequency API calls. For example, when inserting multiple vectors, please use relatively large batch size in /add_data_batch and avoid calling the API too frequently.

The size of each request cannot pass 200MB.

Managing Unique IDs for vectors

Vectors are identified by UNIQUE string IDs. While VecML provide a way to maintain auto-generated string IDs internally, we highly encourage the users to maintain and specify unique IDs for the vectors, for the convenience of database operations. Multiple ways are available to specify the IDs, please check the data inserting functions for details.


Authentication (API key):

VecML RESTful API requests are authenticated through user's API key. The API key can be generated as follows:

  1. Go to https://account.vecml.com/user-api-keys , sign up a free VecML account.
  2. After registration, you can get an API key by clicking "Create New API Key".
  3. This unique API key will only be seen once at the creation time, so please keep the API key safely. If you need to re-generate an API key, simply delete the previous one and then create a new key.

Getting Started: Project and Dataset Management

To use AutoML APIs, the user first needs to upload/insert a vector collection (dataset) to VecML cloud.

See the vector DB API documentation for Project Management and Dataset Management. Also check Job Management for managing the async job status for data insertion/upload.


An Example WorkFlow

VecML Database provides fast and performant built-in ML tools for your project and bussiness. Below is a standard workflow using VecML RESTful API for AutoML model training:

  1. Call /create_project to create a new project.

  2. Create a dataset (or collection, interchangably) within the project and upload data matrix X. The recommended way is:

    • Upload a data file (supported types: csv, json, binary format, libsvm) as a dataset using /upload_automl_X. You can iteratively upload files to a data collection.

    Note: When uploading data, the user needs to specify the categorical features that will be included when training the model.

  3. Attach the label (response) to the dataset. Use /attach_automl_label to upload the Y to the dataset. Now (X, Y) form the AutoMl training problem.

  4. Train an AutoML model for the dataset using /train_automl_model.

    Note: Before starting model training, please call /get_upload_data_status to confirm that X and Y have been uploaded successfull. Otherwise, the model training result might be incorrect.

  5. Make predictions on predict/test dataset /automl_preidct.

The following is an example python code to demonstrate the API workflow.

import requests
import json
import numpy as np
import time

# Configuration
API_KEY = "replace_this_with_your_api_key"
BASE_URL = "https://aidb.vecml.com/api"

def make_request(endpoint, data):
    """Helper function to make API calls"""
    url = f"{BASE_URL}/{endpoint}"
    response = requests.post(url, json=data)
    print(f"Request to {endpoint}: HTTP {response.status_code}")

    if response.text:
        try:
            json_response = response.json()
            print(f"Response: {json_response}")
            return response.status_code, json_response
        except requests.exceptions.JSONDecodeError:
            print(f"Response: {response.text}")
            return response.status_code, {"error": "Not JSON", "message": response.text}
    else:
        print("Response: Empty")
        return response.status_code, None

def wait_for_job_completion(job_id, status_endpoint, max_wait_time=60):
    """Wait for an async job to complete"""
    start_time = time.time()

    while True:
        status_data = {"user_api_key": API_KEY, "job_id": job_id}
        status, status_response = make_request(status_endpoint, status_data)

        if status_response and status_response.get("status") == "finished":
            return True
        elif status_response and status_response.get("status") == "failed":
            return False

        if time.time() - start_time > max_wait_time:
            return False

        time.sleep(2)

def generate_dataset(num_samples, vector_dim, id_prefix, seed=2025):
    """Generate dataset with linear decision boundary"""
    np.random.seed(seed)
    vectors = np.random.randn(num_samples, vector_dim).tolist()
    categories = [np.random.choice(['A', 'B', 'C']) for _ in range(num_samples)]

    labels = []
    for vec, category in zip(vectors, categories):
        # Linear combination of first few components plus category weight
        score = sum(vec[:20]) + {'A': 1.0, 'B': -0.5, 'C': 0.0}[category]
        label = '1' if score > 0 else '0'
        labels.append(label)

    # Generate IDs and attributes
    ids = [f"{id_prefix}_{i:03d}" for i in range(num_samples)]
    attributes = [{"label": str(label), "category": category} for label, category in zip(labels, categories)]

    return vectors, ids, attributes

# Clean up any existing project
status, response = make_request("delete_project", {"user_api_key": API_KEY, "project_name": "AutoML-Demo"})

# 1. Create a project
project_data = {"user_api_key": API_KEY, "project_name": "AutoML-Demo", "application": "Machine Learning"}
status, response = make_request("create_project", project_data)

# 2. Initialize training dataset
init_data = {"user_api_key": API_KEY, "project_name": "AutoML-Demo", "collection_name": "training_data",
             "vector_type": "dense", "vector_dim": 64}
status, response = make_request("init", init_data)

# 3. Generate and add training data using add_data_batch
vectors, ids, attributes = generate_dataset(num_samples=1000, vector_dim=64, id_prefix="train", seed=2025)

# Add training data in batch
batch_data = {"user_api_key": API_KEY, "project_name": "AutoML-Demo", "collection_name": "training_data",
              "string_ids": ids, "data": vectors, "attributes": attributes}
status, response = make_request("add_data_batch", batch_data)
train_upload_job_id = response["job_id"]

# Wait for training data upload to complete
if not wait_for_job_completion(train_upload_job_id, "get_upload_data_status", max_wait_time=30):
    exit(1)

# 4. Train AutoML model
train_data = {"user_api_key": API_KEY, "project_name": "AutoML-Demo", "collection_name": "training_data",
              "model_name": "model1", "training_mode": "high_speed", "task_type": "classification",
              "label_attribute": "label", "train_categorical_features": ["category"]}
status, response = make_request("train_automl_model", train_data)
train_job_id = response["job_id"]

# Wait for training to complete
if not wait_for_job_completion(train_job_id, "get_automl_training_status", max_wait_time=60):
    exit(1)

# 5. Initialize prediction dataset
pred_init_data = {"user_api_key": API_KEY, "project_name": "AutoML-Demo", "collection_name": "prediction_data",
                  "vector_type": "dense", "vector_dim": 64}
status, response = make_request("init", pred_init_data)

# 6. Generate and add prediction data
prediction_vectors, prediction_ids, prediction_attributes = generate_dataset(num_samples=100, vector_dim=64, id_prefix="pred", seed=2026)

# Add prediction data in batch
pred_batch_data = {"user_api_key": API_KEY, "project_name": "AutoML-Demo", "collection_name": "prediction_data",
                   "string_ids": prediction_ids, "data": prediction_vectors, "attributes": prediction_attributes}
status, response = make_request("add_data_batch", pred_batch_data)
pred_upload_job_id = response["job_id"]

# Wait for prediction data upload to complete
if not wait_for_job_completion(pred_upload_job_id, "get_upload_data_status"):
    exit(1)

# 7. Make predictions using the existing dataset
predict_data = {"user_api_key": API_KEY, "project_name": "AutoML-Demo", "collection_name": "training_data",
                "model_name": "model1", "prediction_dataset": "prediction_data"}
status, prediction_results = make_request("automl_predict", predict_data)


Upload AutoML Datasets

/upload_automl_X

Description: Uploads a file containing dataset vectors (CSV, JSON, libsvm, binary format, etc.) and creates a new dataset with those vectors. The upload runs asynchronously; you receive a job_id to query via /get_upload_data_status.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string",   // The new dataset's name
  "X": "string",   // base64-encoded data (optionally gzip compressed)
  "file_format": "string", // "csv", "json", "libsvm", "binary"
  "has_field_names": true | false,
  "vector_type": "dense" | "dense1Bit" | "dense2Bit" | "dense4Bit" | "dense8Bit" | "sparse",

  // If JSON with field names:
  "vector_data_field_name": "string",

  // If JSON or CSV with field names
  "categorical_features": ["attr1", "attr2", ...],

  // If "binary" file_format:
  "binary_dtype": "uint8" | "float32",   // for binary format data, you can set "vector_type" to be dense. The actuall dtype uses "binary_dtype"
  "binary_dim": 123,

  // Optional compression method
  "compression_type": "string",    // now only "gzip" is supported

  // Optional checksum for data integrity check
  "checksum": "string"           // this should be the SHA256 hash computed on "file_data" field
}

Required:

  • project_name: The project that the uploaded dataset belongs to.
  • collection_name: The name of the data collection in VecML DB.
  • X: The Base64 encoded raw data matrix (set of vectors) for AutoML model training.
  • file_format: "csv", "json", "libsvm" or "binary".
  • vector_type: the type of the vector. Supported types:
    • dense: the standard float32 dense vector. For example, [0, 1.2, 2.4, -10 ,5.7]. Standard embedding vectors from language or vision models can be saved as this type.
    • dense8Bit: uint8 dense vectors, with integer vector elements ranging in [0, 255]. For example, [0, 3, 76, 255, 152]. 8-bit quantized embedding vectors can be saved as this type for storage saving.
    • dense4Bit: 4-bit quantized dense vectors, with integer vector elements ranging in [0, 15].
    • dense2Bit: 2-bit quantized dense vectors, with integer vector elements ranging in [0, 3].
    • dense1Bit: 1-bit quantized dense vectors, with binary vector elements.
    • sparse: sparse vector formatted as a set of index:value pairs. Please use this for libsvm file format. This is useful for high-dimensional sparse vectors.

Conditional:

  • For csv and json format data:
    • has_field_names (required): whether the data file contains column headers (csv) or field names (json).
    • vector_data_field_name (if daat_format == json and has_field_names == true): the json field that contains the vector data.
    • vector_attributes (optional, csv and json only): [attr1, attr2, ..., attrN], List of attribute columns/fields associated with the file's vectors.

Optional:

Auto-generation of column names (for CSV)

For csv files, if has_field_names == False, we allow the user to specify categorical feature columns by specifying the number of column (1-based), such as "categorical_features": ["column 59", "column 60"]. The column name must strictly follow the format "column XX".

  • compression_type: to further reduce the file size and speed up communication, the user can gzip the file before converting to Base64 format. If gzip is applied, set this argument to "gzip". Currently, only "gzip" is supported.

  • checksum: it is the SHA256 hash of the file_data field in string format.

Response

{
  "success": true,
  "job_id": "string",
  "error_message": "none",
  "checksum_server": "string"      // if "checksum" is provided in the API request
}

  • job_id: Use this ID to query the status of the upload job via /get_upload_data_status.

  • checksum_server: the checksum computed for the data file X on the server side for data integrity check, if the user request contains checksum field.

Example

POST /upload
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "ProjectA",
  "collection_name": "MyDataset",
  "X": "BASE64_ENCODED_DATA==",
  "file_format": "csv",
  "has_field_names": true,
  "vector_type": "dense",
  "categorical_features": ["date", "region", "category"]
}
Response:
{
  "success": true,
  "job_id": "api_key_123||ProjectA||MyDataset||UploadDataJob",
  "error_message": "none"
}


/attach_automl_label

Description: Attaches label data to an existing dataset by uploading a file containing the label values. The upload runs asynchronously; you receive a job_id to query via /get_upload_data_status.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string",   // The existing dataset's name
  "Y": "string",   // base64-encoded label data file (optionally gzip compressed)
  "label_name": "string",   // The name of the label attribute to be added

  // Optional compression method
  "compression_type": "string"    // now only "gzip" is supported
}

Required:

  • user_api_key: Your VecML API key for authentication.
  • project_name: The project that contains the dataset.
  • collection_name: The name of the existing data collection in VecML DB.
  • Y: The Base64 encoded label data file. The file should be plain text (or csv) file, where each line should contain the string label value, and we will sequentially asign the labels starting from the first vector in the collection.
  • label_name: The name to assign to this label attribute in the dataset.

Optional:

  • compression_type: To further reduce the file size and speed up communication, the user can gzip the file before converting to Base64 format. If gzip is applied, set this argument to "gzip". Currently, only "gzip" is supported.

Response

{
  "success": true,
  "job_id": "string",
  "error_message": "none",
  "num_vectors_parsed": 123
}

  • job_id: Use this ID to query the status of the upload job via /get_upload_data_status.
  • num_vectors_parsed: The number of ID-label pairs successfully parsed from the uploaded file.

Example

POST /attach_automl_label
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "ProjectA",
  "collection_name": "MyDataset",
  "Y": "BASE64_ENCODED_LABEL_DATA==",
  "label_name": "sentiment",
  "compression_type": "gzip"
}

Response:

{
  "success": true,
  "job_id": "api_key_123||ProjectA||MyDataset||AddAttributesFromFileJob",
  "error_message": "none",
  "num_vectors_parsed": 1000
}


AutoML API Endpoints

/train_automl_model

Description: Initiates training of an AutoML model on a specified dataset. This is an asynchronous operation that returns a job ID for tracking progress.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string", 
  "model_name": "string",
  "training_mode": "string",
  "task_type": "string",
  "label_attribute": "string",
  "train_categorical_features": ["string", ...],  // optional
  "validation_dataset": "string",                 // optional
  "data_augmentation": "string"                   // optional
}

Request Fields

  • model_name: Unique name for the model (max 128 characters, no special characters).
  • training_mode: Training mode configuration: "linear_model", "high_speed", "balanced", "high_accuracy".
  • task_type: Type of ML task - either "classification" or "regression".
  • label_attribute: Name of the attribute/column to use as the target label.
  • train_categorical_features: (Optional) Array of attribute names to treat as categorical features when training the model. The training categorical features must be a subset of the categorical features specified when uploading the dataset.
  • validation_dataset: (Optional) Name of separate dataset to use for validation. If not set, default cross-validation will be used. If validation_dataset is provided, it must have the same vector dimensions as the training dataset.
  • data_augmentation: (Optional) Whether to enable data augmentation for better performance. Data augmentation takes some extra time but usually gives better performance. Accepts three values: "off", "low", "high".

Response Success Response (200 OK):

{
  "success": true,
  "job_id": "string"
}

Notes

  • Model names must be unique within a dataset.
  • Training is asynchronous - use /get_automl_training_status with the returned job_id to monitor progress.
  • The training dataset must contain the specified label_attribute. Data samples without label will not be included in the final training set to train the model.

Example

POST /train_automl_model
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "MLProject",
  "collection_name": "customer_data",
  "model_name": "customer_segmentation_v1",
  "training_mode": "auto",
  "task_type": "classification", 
  "label_attribute": "customer_segment",
  "train_categorical_features": ["region", "subscription_type"],
  "validation_dataset": "customer_validation_data"
}

Response:

{
  "success": true,
  "job_id": "user123_MLProject_customer_data_customer_segmentation_v1_TrainJob"
}


/automl_predict

Description: Use a trained AutoML model to make predictions on new data. Can accept data as a file upload or reference an existing dataset.

Method - POST

Request Body

Option 1: Using file upload

{
  "user_api_key": "string",
  "project_name": "string", 
  "collection_name": "string",
  "model_name": "string",
  "file_data": "string",          // base64 encoded file
  "file_format": "string",        // "csv", "json", or "libsvm"
  "has_field_names": boolean,     // required for CSV files
  "compression_type": "string"    // optional: "gzip"
}

Option 2: Using existing dataset

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string", 
  "model_name": "string",
  "prediction_dataset": "string"
}

Request Fields

  • collection_name: Name of the dataset the model was trained on
  • model_name: Name of the trained model to use for prediction
  • file_data: (Option 1) Base64 encoded file containing prediction data
  • file_format: (Option 1) Format of uploaded file: "csv", "json", or "libsvm"
  • has_field_names: (Option 1, CSV only) Whether the CSV file contains column headers
  • compression_type: (Option 1, optional) Set to "gzip" if file is gzip compressed
  • prediction_dataset: (Option 2) Name of existing dataset to make predictions on

Response For Classification Models:

{
  "success": true,
  "num_samples": 1000,
  "predictions": [0.0, 1.0, 2.0, ...],
  "prediction_metric": {
    "accuracy": 0.85,
    "micro_precision": 0.86, 
    "macro_precision": 0.84,
    "micro_recall": 0.85,
    "macro_recall": 0.83,
    "micro_f1": 0.855,
    "macro_f1": 0.835,
    "auc": 0.92
  }
}

For Regression Models:

{
  "success": true,
  "num_samples": 1000,
  "predictions": [12.5, 8.3, 15.7, ...],
  "prediction_metric": {
    "mse": 0.45,  // MSE value
    "mae": 0.62
  }
}

Error Response:

{
  "success": false,
  "error_message": "Error description"
}

Response Fields

  • num_samples: Number of samples that have a label and are used to compute the prediction matrics
  • predictions: Array of prediction values (class labels for classification, numeric values for regression)
  • prediction_metric: Performance metrics if true labels are available in the data
    • For classification: detailed metrics object
    • For regression: MSE (Mean Squared Error) and MAE (Mean Absolute Error)

Notes

  • The prediction data must have the same vector dimensions and format as the training data
  • When uploading files, supported formats are CSV, JSON, LibSVM, and raw binary (concatenated float32 or int8 vectors)
  • Prediction metrics are only calculated if at least one true labels is present in the prediction data
  • Temporary datasets created from file uploads are automatically cleaned up after prediction

Example

POST /automl_predict
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "MLProject",
  "collection_name": "customer_data", 
  "model_name": "customer_segmentation_v1",
  "prediction_dataset": "new_customers"
}

Response:

{
  "success": true,
  "num_samples": 393,
  "predictions": [0, 1, 2, 1, 0, 2, 1, 0, ...],
  "prediction_metric": {
    "accuracy": 0.89,
    "micro_precision": 0.91,
    "macro_precision": 0.88,
    "micro_recall": 0.89,
    "macro_recall": 0.87,
    "micro_f1": 0.90,
    "macro_f1": 0.875,
    "auc": 0.94
  }
}


/get_feature_importance

Description: Retrieves the feature importance scores for a trained AutoML model. Returns the top features ranked by their contribution to the model's predictions.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string",
  "model_name": "string"
}

Required Fields:

  • user_api_key: Your VecML API key for authentication
  • project_name: The project containing the model
  • collection_name: The dataset name used to train the model
  • model_name: The name of the trained model

Response Success Response (200 OK):

{
  "success": true,
  "feature_importance": [
    {
      "feature": "string",
      "importance": 0.0
    },
    ...
  ]
}

Response Fields:

  • success: Boolean indicating if the request was successful
  • feature_importance: Array of feature objects, sorted by importance (descending order)
    • feature: Name of the feature (column name from the dataset)
    • importance: Numerical importance score (higher values indicate more important features)

Notes:

  • Returns the top 50 most important features by default
  • Feature names correspond to the column names in the original training dataset

Example

POST /get_feature_importance
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "ProjectA",
  "collection_name": "MyDataset",
  "model_name": "classification_model_1"
}

Response:

{
  "success": true,
  "feature_importance": [
    {
      "feature": "income",
      "importance": 0.2543
    },
    {
      "feature": "age",
      "importance": 0.1876
    },
    {
      "feature": "credit_score",
      "importance": 0.1432
    },
    {
      "feature": "employment_length",
      "importance": 0.0987
    },
    {
      "feature": "debt_ratio",
      "importance": 0.0654
    }
  ]
}


/delete_automl_model

Description: Deletes a specific AutoML model from a dataset. This action cannot be undone.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string", 
  "collection_name": "string",
  "model_name": "string"
}

Response Success Response (200 OK):

{
  "success": true
}

Example

POST /delete_automl_model
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "ProjectA",
  "collection_name": "MyDataset",
  "model_name": "old_classification_model"
}

Response:

{
  "success": true
}


/get_automl_training_status

Description: Retrieves the current status and progress of an AutoML training job.

Method - POST

Request Body

{
  "user_api_key": "string",
  "job_id": "string"
}

Response For In-Progress Jobs:

{
  "success": true,
  "task_type": "classification",
  "model_name": "model_123",
  "label_attribute": "category",
  "status": "in_progress",
  "start_time": "2025/01/15 14:30:25",
  "duration": "00:05:30"
}

For Completed Jobs:

{
  "success": true,
  "task_type": "classification",
  "model_name": "model_123", 
  "label_attribute": "category",
  "status": "finished",
  "start_time": "2025/01/15 14:30:25",
  "duration": "00:12:45",
  "validation_metric": {
    "accuracy": 0.85,
    "micro_precision": 0.86,
    "macro_precision": 0.84,
    "micro_recall": 0.85,
    "macro_recall": 0.83,
    "micro_f1": 0.855,
    "macro_f1": 0.835,
    "auc": 0.92
  }
}

For Failed Jobs:

{
  "success": true,
  "task_type": "regression",
  "model_name": "model_456",
  "label_attribute": "price", 
  "status": "failed",
  "start_time": "2025/01/15 15:00:10",
  "duration": "00:02:15",
  "error": "Training failed"
}

Notes

  • Status values: "pending", "in_progress", "finished", "failed"

  • For classification tasks, validation_metric contains detailed metrics

  • For regression tasks, validation_metric contains MSE (Mean Squared Error) and MAE (Mean Absolute Error)

  • Duration is formatted as "HH:MM:SS"

Example

POST /get_automl_training_status
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "job_id": "training_job_789"
}

Response:

{
  "success": true,
  "task_type": "classification",
  "model_name": "customer_segment_model",
  "label_attribute": "segment",
  "status": "finished", 
  "start_time": "2025/01/15 14:30:25",
  "duration": "00:08:42",
  "validation_metric": {
    "accuracy": 0.91,
    "micro_precision": 0.92,
    "macro_precision": 0.90,
    "micro_recall": 0.91,
    "macro_recall": 0.89,
    "micro_f1": 0.915,
    "macro_f1": 0.895,
    "auc": 0.96
  }
}


/get_model_validation_metric

Description: Retrieves validation metrics for a specific trained AutoML model.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string", 
  "model_name": "string"
}

Response: For Classification Models:

{
  "success": true,
  "validation_metric": {
    "accuracy": 0.85,
    "micro_precision": 0.86,
    "macro_precision": 0.84,
    "micro_recall": 0.85,
    "macro_recall": 0.83,
    "micro_f1": 0.855,
    "macro_f1": 0.835,
    "auc": 0.92
  }
}

For Regression Models:

{
  "success": true,
  "validation_metric": {
    "mse": 0.45,   // MSE value
    "mae": 0.61
  }
}

Example

POST /get_model_validation_metric
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "ProjectA", 
  "collection_name": "MyDataset",
  "model_name": "classification_model_1"
}

Response:

{
  "success": true,
  "validation_metric": {
    "accuracy": 0.89,
    "micro_precision": 0.90,
    "macro_precision": 0.88,
    "micro_recall": 0.89,
    "macro_recall": 0.87,
    "micro_f1": 0.895,
    "macro_f1": 0.875,
    "auc": 0.94
  }
}


/list_automl_model_infos

Description: Retrieves metadata for all AutoML models within a specific dataset.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string"
}

Response Success Response (200 OK):

{
  "success": true,
  "model_infos": [
    {
      "model_name": "string",
      "task_type": "string",
      "training_mode": "string",
      "label_attribute": "string",
      "data_augmentation": bool,
      "validation_dataset": "string",
      "create_time": "string"
    },
    ...
  ]
}

Response Fields: - model_name: Name of the trained model - task_type: Type of machine learning task ("classification" or "regression") - training_mode: Training mode used ("high_speed", "balanced", or "high_accuracy") - label_attribute: Name of the label/target attribute used for training - data_augmentation: Whether data augmentation is used - validation_dataset: Name of the validation dataset used ("cross_validation" if auto-split was used) - create_time: Timestamp when the model was created

Example

POST /list_automl_model_infos
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "ProjectA",
  "collection_name": "MyDataset"
}

Response:

{
  "success": true,
  "model_infos": [
    {
      "model_name": "classification_model_1",
      "task_type": "classification",
      "training_mode": "high_speed",
      "label_attribute": "category",
      "data_augmentation": "true",
      "validation_dataset": "",
      "create_time": "2025/01/15, 14:30"
    },
    {
      "model_name": "regression_model_1",
      "task_type": "regression",
      "training_mode": "balanced",
      "label_attribute": "price",
      "data_augmentation": "false",
      "validation_dataset": "validation_set",
      "create_time": "2025/01/16, 09:15"
    }
  ]
}


Clustering API Endpoints

VecML provides built-in clustering algorithms (K-Means and DBSCAN) that run directly on your vector collections. Training is asynchronous — you receive a job_id and poll for status.

Clustering Workflow

A typical clustering workflow:

  1. Create a project and upload a vector dataset (see Project Management and Dataset Management).
  2. (Optional) Use /get_dbscan_kdist_plot to explore your data and choose DBSCAN parameters.
  3. Train a clustering model using /train_kmeans or /train_dbscan.
  4. Check training status with /get_clustering_training_status.
  5. Retrieve cluster labels with /get_clustering_labels.
  6. Evaluate results with /compute_clustering_metrics.
  7. Visualize clusters with /compute_tsne_embeddings.

/train_kmeans

Description: Initiates training of a K-Means clustering model on a specified dataset. This is an asynchronous operation that returns a job ID for tracking progress.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string",
  "cluster_model_name": "string",
  "kmeans_method": "string",
  "n_clusters": 10,
  "n_init": 3,
  "max_train_samples": 10000,
  "max_iterations": 100,
  "tolerance": 1e-4,
  "num_threads": 4,
  "random_seed": 1234567,
  "initialization_method": "string",
  "dist_type": "string"
}

Required Fields:

  • user_api_key: Your VecML API key for authentication.
  • project_name: The project containing the dataset.
  • collection_name: The dataset to cluster.
  • cluster_model_name: Unique name for this clustering model (max 128 characters).
  • kmeans_method: K-Means variant to use. Supported values: "kmeans" (standard Lloyd's algorithm), "kmeans_hamerly" (Hamerly's accelerated algorithm, faster for low-dimensional data).
  • n_clusters: Number of clusters to form.

Optional Fields:

Field Default Description
n_init 3 Number of times to run K-Means with different random seeds. The best result (lowest inertia) is kept.
max_train_samples 10000 Maximum number of samples used for training. If the dataset is larger, a random subset is used.
max_iterations 100 Maximum number of iterations per K-Means run.
tolerance 1e-4 Convergence threshold. Training stops when the change in centroids falls below this value.
num_threads 4 Number of threads for parallel computation.
random_seed 1234567 Random seed for reproducibility.
initialization_method "k-means++" Centroid initialization strategy. "k-means++" provides smarter initialization; "random" selects random data points.
dist_type "Euclidean" Distance metric. Supported: "Euclidean" / "L2", "Manhattan" / "L1", "Cosine" / "cosine", "Inner Product", "Hamming".

Response

{
  "success": true,
  "job_id": "string"
}

Notes:

  • Model names must be unique within a dataset.
  • Use /get_clustering_training_status with the returned job_id to monitor progress.
  • For high-dimensional embedding vectors (e.g., 768 or 1024 dimensions), "Cosine" distance is often the best choice.
  • n_init > 1 improves result quality at the cost of longer training time.

Example

POST /train_kmeans
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "EmbeddingProject",
  "collection_name": "document_embeddings",
  "cluster_model_name": "topic_clusters_v1",
  "kmeans_method": "kmeans",
  "n_clusters": 20,
  "n_init": 5,
  "dist_type": "Cosine",
  "initialization_method": "k-means++"
}

Response:

{
  "success": true,
  "job_id": "user123||EmbeddingProject||document_embeddings||topic_clusters_v1||KMeansTrainJob||abc-123"
}


/train_dbscan

Description: Initiates training of a DBSCAN (Density-Based Spatial Clustering of Applications with Noise) clustering model. DBSCAN automatically discovers the number of clusters based on data density and can identify noise points that don't belong to any cluster.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string",
  "cluster_model_name": "string",
  "eps": 0.5,
  "min_samples": 5,
  "dist_type": "string",
  "num_threads": 4,
  "random_seed": 1234567
}

Required Fields:

  • user_api_key: Your VecML API key for authentication.
  • project_name: The project containing the dataset.
  • collection_name: The dataset to cluster.
  • cluster_model_name: Unique name for this clustering model.
  • eps: The maximum distance between two points for them to be considered neighbors. This is the most important parameter — see Choosing DBSCAN Parameters below.
  • min_samples: The minimum number of points required to form a dense region (core point). Points in regions with fewer neighbors are classified as noise.
  • dist_type: Distance metric. Supported: "Euclidean" / "L2", "Manhattan" / "L1", "Cosine" / "cosine", "Inner Product", "Hamming".

Optional Fields:

Field Default Description
num_threads 4 Number of threads for parallel computation.
random_seed 1234567 Random seed for reproducibility.

Response

{
  "success": true,
  "job_id": "string"
}

Notes:

  • Unlike K-Means, DBSCAN does not require specifying the number of clusters — it discovers them automatically.
  • Points that don't belong to any cluster are labeled as noise (label = -1).
  • DBSCAN works well for datasets with clusters of varying shapes and sizes.
  • The eps and min_samples parameters significantly affect results. Use the K-distance plot (/get_dbscan_kdist_plot) to guide your parameter selection.

Example

POST /train_dbscan
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "AnomalyProject",
  "collection_name": "sensor_readings",
  "cluster_model_name": "anomaly_detection_v1",
  "eps": 0.3,
  "min_samples": 10,
  "dist_type": "Euclidean"
}

Response:

{
  "success": true,
  "job_id": "user123||AnomalyProject||sensor_readings||anomaly_detection_v1||DBSCANTrainJob||def-456"
}

Choosing DBSCAN Parameters

DBSCAN's quality depends heavily on two parameters: eps (neighborhood radius) and min_samples (minimum density). Here's how to choose them:

min_samples:

  • A good starting point is min_samples = 2 * dimension of your data, but for high-dimensional embeddings this may be too large.
  • For most practical use cases with embedding vectors, values between 5 and 20 work well.
  • Higher values produce fewer, denser clusters and more noise points. Lower values produce more clusters and less noise.

eps (using the K-distance plot):

The K-distance plot is the recommended way to choose eps. Here's how it works:

  1. For each point in the dataset, compute the distance to its k-th nearest neighbor (where k = min_samples).
  2. Sort these distances in descending order and plot them.
  3. Look for the "elbow" — the point where the curve transitions from steep to flat. The distance value at this elbow is a good candidate for eps.

Interpreting the K-distance plot:

  • Steep region (left side): These are noise points or outliers — they have large distances to their k-th neighbor.
  • Flat region (right side): These are points inside dense clusters — their k-th neighbor distances are similar and small.
  • Elbow point: The transition between noise and clusters. Setting eps at this value separates meaningful clusters from noise.

Use /get_dbscan_kdist_plot to compute the K-distance values and slope, then look for the elbow in kdist_grid_desc where slope_grid_desc shows the steepest change.

Tips:

  • If too many points are noise: increase eps or decrease min_samples.
  • If everything is one big cluster: decrease eps or increase min_samples.
  • Run DBSCAN with a few different eps values around the elbow and compare results using /compute_clustering_metrics.

/get_clustering_training_status

Description: Retrieves the current status of a clustering training job (K-Means or DBSCAN).

Method - POST

Request Body

{
  "user_api_key": "string",
  "job_id": "string"
}

Response For In-Progress Jobs:

{
  "success": true,
  "job_id": "string",
  "model_name": "string",
  "model_type": "kmeans",
  "kmeans_method": "kmeans",
  "status": "in_progress",
  "error_code": "Success",
  "params": { ... },
  "start_time": "2025/06/15 14:30:25",
  "end_time": "2025/06/15 14:30:55",
  "duration": "00:00:30"
}

For Completed Jobs:

{
  "success": true,
  "job_id": "string",
  "model_name": "string",
  "model_type": "kmeans",
  "status": "finished",
  "error_code": "Success",
  "params": {
    "kmeans_method": "kmeans",
    "n_clusters": "20",
    "n_init": "3",
    "dist_type": "NegativeCosineSimilarity",
    "initialization_method": "k-means++",
    "max_iterations": "100",
    "tolerance": "0.000100",
    "num_threads": "4",
    "random_seed": "1234567",
    "max_train_samples": "10000"
  },
  "start_time": "2025/06/15 14:30:25",
  "end_time": "2025/06/15 14:32:10",
  "duration": "00:01:45"
}

For Failed Jobs:

{
  "success": true,
  "job_id": "string",
  "model_name": "string",
  "model_type": "dbscan",
  "status": "failed",
  "error_code": "Unknown",
  "error_message": "Description of the failure",
  "params": { ... },
  "start_time": "2025/06/15 15:00:10",
  "end_time": "2025/06/15 15:00:12",
  "duration": "00:00:02"
}

Response Fields:

  • status: One of "pending", "in_progress", "finished", "failed".
  • model_type: "kmeans" or "dbscan".
  • kmeans_method: (K-Means only) The variant used.
  • params: All training parameters as key-value pairs.
  • error_message: (Failed jobs only) Description of the error.

/list_clustering_model_infos

Description: Retrieves metadata for all clustering models within a specific dataset.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string"
}

Response

{
  "success": true,
  "model_infos": [
    {
      "model_name": "topic_clusters_v1",
      "model_type": "kmeans",
      "distance_type": "NegativeCosineSimilarity",
      "create_time": "2025/06/15 14:30",
      "model_path": "path/to/model",
      "params": {
        "n_clusters": 20,
        "n_init": 3,
        "kmeans_method": "kmeans",
        "initialization_method": "k-means++",
        "max_iterations": 100,
        "max_train_samples": 10000,
        "tolerance": 0.0001
      }
    },
    {
      "model_name": "anomaly_v1",
      "model_type": "dbscan",
      "distance_type": "Euclidean",
      "create_time": "2025/06/16 09:15",
      "model_path": "path/to/model",
      "params": {
        "eps": 0.3,
        "min_samples": 10
      }
    }
  ]
}

Response Fields:

  • model_type: "kmeans", "kmeans_hamerly", or "dbscan".
  • distance_type: The distance metric used for training.
  • params: Algorithm-specific parameters (K-Means params or DBSCAN params).

/delete_clustering_model

Description: Deletes a specific clustering model from a dataset. This action cannot be undone.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string",
  "model_name": "string"
}

Response

{
  "success": true,
  "status": "finished",
  "error_code": "Success"
}

Example

POST /delete_clustering_model
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "ProjectA",
  "collection_name": "MyDataset",
  "model_name": "old_kmeans_model"
}


/get_clustering_labels

Description: Retrieves the cluster label assigned to each vector in the dataset by a trained clustering model.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string",
  "model_name": "string"
}

Response

{
  "success": true,
  "error_code": "Success",
  "labels": [0, 1, 2, 0, 3, -1, 2, 1, ...]
}

Response Fields:

  • labels: Array of integer cluster labels, one per vector in the dataset. The order matches the internal vector ordering.
    • For K-Means: labels range from 0 to n_clusters - 1.
    • For DBSCAN: labels range from 0 to num_clusters - 1, with -1 indicating noise points.

Example

POST /get_clustering_labels
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "ProjectA",
  "collection_name": "MyDataset",
  "model_name": "topic_clusters_v1"
}


/get_kmeans_centroids

Description: Retrieves the centroid vectors of a trained K-Means model.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string",
  "model_name": "string"
}

Response

{
  "success": true,
  "error_code": "Success",
  "centroids": [
    [0.12, -0.45, 0.78, ...],
    [0.56, 0.23, -0.91, ...],
    ...
  ]
}

Response Fields:

  • centroids: 2D array of shape [n_clusters, vector_dim]. Each row is the centroid vector for one cluster.

Notes:

  • This endpoint is only available for K-Means models. Calling it on a DBSCAN model will return an error.
  • Centroids can be used for nearest-centroid classification or as representative vectors for each cluster.

/get_dbscan_core_points

Description: Retrieves the indices of core points identified by a trained DBSCAN model. Core points are data points that have at least min_samples neighbors within eps distance.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string",
  "model_name": "string"
}

Response

{
  "success": true,
  "error_code": "Success",
  "core_points": [0, 3, 5, 7, 12, 15, ...]
}

Response Fields:

  • core_points: Array of integer indices identifying which vectors in the dataset are core points.

Notes:

  • This endpoint is only available for DBSCAN models.
  • Core points are the "anchors" of each cluster. Non-core points that are within eps of a core point are border points; the rest are noise.

/get_dbscan_kdist_plot

Description: Computes the K-distance plot data for a dataset, which is used to determine the optimal eps parameter for DBSCAN. The K-distance plot shows, for each point, the distance to its k-th nearest neighbor, sorted in descending order.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string",
  "k": 5,
  "max_num_samples": 2000,
  "max_num_grids": 200,
  "dist_type": "string",
  "num_threads": 3,
  "random_seed": 1234567
}

Required Fields:

  • user_api_key: Your VecML API key.
  • project_name: The project containing the dataset.
  • collection_name: The dataset to analyze.

Optional Fields:

Field Default Description
k 5 The k-th nearest neighbor to compute distances for. Should match the min_samples value you plan to use for DBSCAN.
max_num_samples 2000 Maximum number of samples to use for computing the plot. A random subset is selected if the dataset is larger.
max_num_grids 200 Number of evenly spaced grid points in the output arrays. Controls the resolution of the plot.
dist_type "Euclidean" Distance metric. Should match the dist_type you plan to use for DBSCAN. Supported: "Euclidean" / "L2", "Manhattan" / "L1", "Cosine" / "cosine", "Inner Product", "Hamming".
num_threads 3 Number of threads for parallel computation.
random_seed 1234567 Random seed for reproducibility.

Response

{
  "success": true,
  "error_code": "Success",
  "k": 5,
  "dist_type": "Euclidean",
  "max_num_samples": 2000,
  "max_num_grids": 200,
  "kdist_grid_desc": [2.45, 2.41, 2.38, ..., 0.12, 0.08],
  "slope_grid_desc": [-0.02, -0.03, -0.05, ..., -0.8, -1.2]
}

Response Fields:

  • kdist_grid_desc: Array of K-distance values sorted in descending order, sampled at max_num_grids evenly spaced grid points. These are the y-axis values of the K-distance plot.
  • slope_grid_desc: Array of slope values at each grid point (first derivative of the K-distance curve). Large negative slopes indicate the "elbow" region.

How to Use the Response:

  1. Plot the curve: Use kdist_grid_desc as y-values (x-axis is just the index from 0 to max_num_grids - 1).
  2. Find the elbow: Look for where slope_grid_desc has the most negative value (steepest descent). The corresponding kdist_grid_desc value at that index is a good candidate for eps.
  3. Alternatively: Visually inspect the plot for the "knee" — the transition point from steep to flat.

Example

POST /get_dbscan_kdist_plot
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "ProjectA",
  "collection_name": "MyDataset",
  "k": 10,
  "dist_type": "Cosine",
  "max_num_samples": 5000,
  "max_num_grids": 300
}

Response:

{
  "success": true,
  "error_code": "Success",
  "k": 10,
  "dist_type": "NegativeCosineSimilarity",
  "max_num_samples": 5000,
  "max_num_grids": 300,
  "kdist_grid_desc": [0.95, 0.93, 0.91, 0.88, ...],
  "slope_grid_desc": [-0.01, -0.01, -0.02, -0.03, ...]
}


/compute_clustering_metrics

Description: Computes evaluation metrics for a trained clustering model. Supports both unsupervised metrics (no ground truth needed) and supervised metrics (when ground truth labels are available).

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string",
  "model_name": "string",
  "ground_truth_attribute": "string",
  "ground_truth_labels": ["string", ...]
}

Required Fields:

  • user_api_key: Your VecML API key.
  • project_name: The project containing the dataset.
  • collection_name: The dataset.
  • model_name: The trained clustering model to evaluate.

Optional Fields:

  • ground_truth_attribute: Name of a vector attribute containing true cluster labels. Required for supervised metrics.
  • ground_truth_labels: Array of ground truth label strings, one per vector. Alternative to ground_truth_attribute. Required for supervised metrics.

Response

{
  "success": true,
  "status": "finished",
  "metrics": {
    "distance_type": "Euclidean",
    "ground_truth_attribute": "",
    "ground_truth_labels_provided": false,
    "noise_ratio": 0.05,
    "inertia_sse": 1234.56,
    "davies_bouldin": 1.23,
    "calinski_harabasz": 456.78,
    "silhouette": 0.45,
    "adjusted_rand_index": 0.67,
    "normalized_mutual_info": 0.72,
    "best_match_accuracy": 0.81,
    "homogeneity": 0.75,
    "completeness": 0.70,
    "v_measure": 0.72
  }
}

Unsupervised Metrics (always computed):

Metric Description Good Values
noise_ratio Fraction of points labeled as noise (-1). Only meaningful for DBSCAN. Lower is usually better, but some noise is expected.
inertia_sse / L1_dispersion / cosine_dispersion Within-cluster sum of distances (name depends on dist_type). Only for Euclidean, Manhattan, or Cosine distance. Lower is better. Not comparable across different n_clusters.
davies_bouldin Ratio of within-cluster to between-cluster distances. Lower is better. 0 is perfect.
calinski_harabasz Ratio of between-cluster to within-cluster variance. Higher is better.
silhouette How similar each point is to its own cluster vs. nearest neighboring cluster. Range [-1, 1]. Higher is better. > 0.5 is good.

Supervised Metrics (only when ground truth is provided):

Metric Description Range
adjusted_rand_index Similarity between predicted and true labels, adjusted for chance. [-1, 1]. 1 = perfect.
normalized_mutual_info Mutual information between predicted and true labels, normalized. [0, 1]. 1 = perfect.
best_match_accuracy Best one-to-one matching accuracy between predicted and true labels. [0, 1]. 1 = perfect.
homogeneity Whether each cluster contains only members of a single true class. [0, 1]. 1 = perfect.
completeness Whether all members of a true class are assigned to the same cluster. [0, 1]. 1 = perfect.
v_measure Harmonic mean of homogeneity and completeness. [0, 1]. 1 = perfect.

Example

POST /compute_clustering_metrics
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "ProjectA",
  "collection_name": "MyDataset",
  "model_name": "topic_clusters_v1",
  "ground_truth_attribute": "true_category"
}


/compute_tsne_embeddings

Description: Computes 2D t-SNE embeddings for visualizing clustering results. Each point is projected to 2D space and labeled with its cluster assignment.

Method - POST

Request Body

{
  "user_api_key": "string",
  "project_name": "string",
  "collection_name": "string",
  "model_name": "string",
  "max_samples": 2000,
  "perplexity": 30.0,
  "theta": 0.5,
  "max_iter": 600,
  "learning_rate": 200.0
}

Required Fields:

  • user_api_key: Your VecML API key.
  • project_name: The project containing the dataset.
  • collection_name: The dataset.
  • model_name: The trained clustering model (used for coloring points by cluster label).

Optional Fields:

Field Default Description
max_samples 2000 Maximum number of points to include. A random subset is selected if the dataset is larger.
perplexity 30.0 Balances attention between local and global structure. Typical range: 5–50. Lower values emphasize local structure.
theta 0.5 Speed/accuracy trade-off for Barnes-Hut approximation. 0 = exact (slow), 1 = fast (approximate).
max_iter 600 Maximum number of optimization iterations.
learning_rate 200.0 Step size for gradient descent. Typical range: 10–1000.

Response

{
  "success": true,
  "status": "finished",
  "num_points": 2000,
  "parameters": {
    "max_samples": 2000,
    "perplexity": 30.0,
    "theta": 0.5,
    "max_iter": 600
  },
  "embeddings": [
    {"label": 0, "x": 12.34, "y": -5.67},
    {"label": 2, "x": -8.91, "y": 3.45},
    {"label": -1, "x": 45.12, "y": 22.33},
    ...
  ]
}

Response Fields:

  • num_points: Number of points in the embedding.
  • embeddings: Array of 2D points, each with:
    • label: Cluster assignment from the clustering model (-1 = noise for DBSCAN).
    • x, y: 2D coordinates from t-SNE.

Notes:

  • t-SNE is non-deterministic — different runs may produce different layouts, but cluster structure should be consistent.
  • For large datasets, use max_samples to limit computation time.
  • Points with label -1 (DBSCAN noise) are typically scattered away from cluster centers in the visualization.

Example

POST /compute_tsne_embeddings
Content-Type: application/json

{
  "user_api_key": "api_key_123",
  "project_name": "ProjectA",
  "collection_name": "MyDataset",
  "model_name": "topic_clusters_v1",
  "max_samples": 5000,
  "perplexity": 40.0
}