Query

One of Cubyc's most powerful features is its ability to query your experiment runs directly with SQL. This guide covers the basics of querying your runs, including the schema, examples, and how to run queries from the command line.

Directory

To understand Cubyc's querying syntax, you first need to understand how Cubyc records your experiment runs. Let's start by looking at the directory structure of a typical Cubyc project.

.
├── .cubyc
│   ├── ast.json
│   ├── config.yaml
│   ├── logs.csv
│   └── metadata.json
├── .git
├── .gitignore
├── model.pkl
├── plot.png
└── experiment.py

The file used to run the experiment, experiment.py, will be saved in your project directory along with any generated artifacts such as model.pkl and plot.png. This directory will also includes standard versioning files like .git and .gitignore, and a .cubyc directory that stores your run's hyperparameters, logs, metadata, and code abstract syntax tree (AST).

config.yamllogs.csvmetadata.jsonast.json

De-serialized hyperparameters used in the experiment.

model: 
    type: "MLP"
    hidden_layers: 2
optimizer: 
    type: "Adam"
    lr: 0.001
...

Log table with columns for timestamp, metric name, and value.

timestamp,name,value
2021-01-01T00:00:00,loss,0.245
2021-01-02T00:00:00,acc,0.87
2021-01-03T00:00:00,loss,0.132
2021-01-04T00:00:00,acc,0.92
2021-01-05T00:00:00,loss,0.245
...

Run metadata, including the Python version, timestamp, dependencies, and code.

{
    "python": {"version": "3.8"},
    "timestamp": "2021-01-01T00:00:00",
    "code": "for epoch in range 10: ..."
    "requirements": ["matplotlib==3.3.3", "numpy==1.19.5", ...]
    ...
}

Abstract syntax tree of the experiment code, used for automatic version control.

{
"_type": "Module",
"body": [
    {
        "_type": "Assign", ...
    },
}

Tip

Add any files you want to ignore to the .gitignore file to prevent them from being tracked by Cubyc.

Schema

Cubyc employs a PostgreSQL schema to store your experiment runs, which is structured into four main tables: config, logs, metadata, and comments. These tables are all linked by a common id column, which represents the commit hash of the associated run.

ConfigLogsMetadataComments

The config table stores input hyperparameters for each run, with columns for hyperparameters, and rows for unique runs.

id	learning_rate	batch_size	dropout
`f34e2a1`	`0.001`	`32`	`0.5`
`eef96c7`	`0.01`	`64`	`0.2`
`p3e2a1f`	`0.0001`	`128`	`0.3`
`s3e2a1f`	`0.005`	`256`	`NaN`

Warning

NaN values indicate missing hyperparameters, which occur when these are added or removed between runs.

The logs table stores logged metrics for each run, with columns for metric names and value, and rows for individual log entries.

id	name	value
`f34e2a1`	`loss`	`0.245`
`f34e2a1`	`acc`	`0.87`
`eef96c7`	`loss`	`0.132`
`eef96c7`	`acc`	`0.92`

Casting Values

Use the CAST function to convert value types in logs, e.g., SELECT CAST(value AS FLOAT) FROM logs

The metadata table stores additional run information, such as the code and dependencies, with columns for metadata fields and rows for unique runs.

id	python__version	code
`f34e2a1`	`3.8`	`import pandas as pd ...`
`eef96c7`	`3.7`	`import numpy as np ...`
`p3e2a1f`	`3.8`	`import matplotlib.pyplot as plt ...`
`s3e2a1f`	`3.7`	`import seaborn as sns ...`

Tip

Check out any .cubyc/metadata.json for the full list of metadata fields.

The comments table stores user comments for each run, with columns for authors, text, and timestamps, and rows for individual comments.

id	author	comment	timestamp
`f34e2a1`	safranchik	This model is underperforming	`2023-05-01T00:00:00`
`f34e2a1`	jensrischbieth	Let's try increasing the learning rate	`2023-05-02T00:00:00`
`eef96c7`	safranchik	Accuracy looks good, let's keep these settings	`2023-05-03T00:00:00`
`eef96c7`	jensrischbieth	Agreed, this model is ready for production	`2023-05-04T00:00:00`

Leaving Comments

You can add comments to your runs directly through your repository hosting platform.

Code

With Cubyc's Python library, you can directly query your runs from your code and return the results as a Pandas DataFrame. This feature is particularly useful for integrating your queries into Python data analysis pipelines. Simply pass your SQL statement to the query function:

from cubyc import query

statement = """
                SELECT config.learning_rate, metadata.timestamp
                FROM config
                INNER JOIN metadata ON config.id = metadata.id
            """

query(statement=statement)

Remote Only

To query a remote repository, you can pass the URL of the remote repository directly to the query function:

query(statement=statement, remote="https://github.com/owner/project.git")

Command-Line

You can query your runs directly from your code by using the cubyc query command.

You can also run local and remote queries directly from the command line. For example, to query the config table from a remote repository, run:

cubyc query "SELECT * FROM config" --path path/to/your/project

The same options are available for command line queries as for Python queries. For more information on running queries from the command line, see Cubyc's commands page.

Remote Only

To query a remote repository, you can pass the URL of the remote repository directly to the query command:

cubyc query "SELECT * FROM config" --path https://github.com/owner/project.git

Branches

By default, Cubyc queries all of your runs across all branches. However, you can easily query the experiment history of a specific branch, To query a different branch or commit, simply pass the branch parameter to the query function.

query(statement=..., branch="fancy-hippoppotamus")

Examples

To familiarize yourself with Cubyc's querying capabilities, here are some practical examples of queries you can run:

Example 1: Query the top 5 runs with the highest accuracy

SELECT 
    config.id, 
    MAX(CAST(logs.value AS FLOAT)) AS max_accuracy
FROM 
    config
INNER JOIN 
    logs ON config.id = logs.id
WHERE 
    logs.name = 'accuracy'
GROUP BY 
    config.id
ORDER BY 
    max_accuracy DESC
LIMIT 5

Example 1: Example 2: Query the average loss for each learning rate

SELECT 
    config.learning_rate, 
    AVG(CAST(logs.value AS FLOAT)) AS avg_loss
FROM 
    config
INNER JOIN 
    logs ON config.id = logs.id
WHERE 
    logs.name = 'loss'
GROUP BY 
    config.learning_rate

Example 3: Query runs with comments mentioning the word `approve`

SELECT 
    comments.id, 
    comments.comment
FROM 
    comments
WHERE 
    comments.comment LIKE '%approve%'