Query
One of Cubyc's most powerful features is its ability to query your experiment runs directly with SQL. This guide covers the basics of querying your runs, including the schema, examples, and how to run queries from the command line.
Directory
To understand Cubyc's querying syntax, you first need to understand how Cubyc records your experiment runs. Let's start by looking at the directory structure of a typical Cubyc project.
.
├── .cubyc
│ ├── ast.json
│ ├── config.yaml
│ ├── logs.csv
│ └── metadata.json
├── .git
├── .gitignore
├── model.pkl
├── plot.png
└── experiment.py
The file used to run the experiment, experiment.py
, will be saved in your project directory
along with any generated artifacts such as model.pkl
and plot.png
.
This directory will also includes standard versioning files like .git
and .gitignore
,
and a .cubyc
directory that stores your run's hyperparameters, logs, metadata, and code abstract syntax tree (AST).
De-serialized hyperparameters used in the experiment.
Log table with columns for timestamp, metric name, and value.
Run metadata, including the Python version, timestamp, dependencies, and code.
Tip
Add any files you want to ignore to the .gitignore
file to prevent them from being tracked by Cubyc.
Schema
Cubyc employs a PostgreSQL schema to store your experiment runs,
which is structured into four main tables: config
, logs
, metadata
, and comments
.
These tables are all linked by a common id column, which represents the commit hash of the associated run.
The config
table stores input hyperparameters for each run, with columns for hyperparameters,
and rows for unique runs.
id | learning_rate | batch_size | ... | dropout |
---|---|---|---|---|
f34e2a1 |
0.001 |
32 |
0.5 |
|
eef96c7 |
0.01 |
64 |
0.2 |
|
p3e2a1f |
0.0001 |
128 |
0.3 |
|
s3e2a1f |
0.005 |
256 |
NaN |
Warning
NaN
values indicate missing hyperparameters, which occur when these are added or removed between runs.
The logs
table stores logged metrics for each run, with columns for metric names and value,
and rows for individual log entries.
id | name | value |
---|---|---|
f34e2a1 |
loss |
0.245 |
f34e2a1 |
acc |
0.87 |
eef96c7 |
loss |
0.132 |
eef96c7 |
acc |
0.92 |
Casting Values
Use the CAST function to convert value types in logs,
e.g., SELECT CAST(value AS FLOAT) FROM logs
The metadata
table stores additional run information, such as the code and dependencies,
with columns for metadata fields and rows for unique runs.
id | python__version | ... | code |
---|---|---|---|
f34e2a1 |
3.8 |
import pandas as pd ... |
|
eef96c7 |
3.7 |
import numpy as np ... |
|
p3e2a1f |
3.8 |
import matplotlib.pyplot as plt ... |
|
s3e2a1f |
3.7 |
import seaborn as sns ... |
Tip
Check out any .cubyc/metadata.json
for the full list of metadata fields.
The comments
table stores user comments for each run,
with columns for authors, text, and timestamps, and rows for individual comments.
id | author | comment | timestamp |
---|---|---|---|
f34e2a1 |
safranchik | This model is underperforming | 2023-05-01T00:00:00 |
f34e2a1 |
jensrischbieth | Let's try increasing the learning rate | 2023-05-02T00:00:00 |
eef96c7 |
safranchik | Accuracy looks good, let's keep these settings | 2023-05-03T00:00:00 |
eef96c7 |
jensrischbieth | Agreed, this model is ready for production | 2023-05-04T00:00:00 |
Leaving Comments
You can add comments to your runs directly through your repository hosting platform.
Code
With Cubyc's Python library, you can directly query your runs from your code and return the results as a Pandas DataFrame. This feature is particularly useful for integrating your queries into Python data analysis pipelines. Simply pass your SQL statement to the query function:
Remote Only
To query a remote repository, you can pass the URL of the remote repository directly to the query
function:
Command-Line
You can query your runs directly from your code by using the cubyc query
command.
You can also run local and remote queries directly from the command line. For example, to query the config
table from a remote repository, run:
The same options are available for command line queries as for Python queries. For more information on running queries from the command line, see Cubyc's commands page.
Remote Only
To query a remote repository, you can pass the URL of the remote repository directly to the query
command:
Branches
By default, Cubyc queries all of your runs across all branches.
However, you can easily query the experiment history of a specific branch,
To query a different branch or commit, simply pass the branch
parameter to the query
function.
Examples
To familiarize yourself with Cubyc's querying capabilities, here are some practical examples of queries you can run:
Example 1: Query the top 5 runs with the highest accuracy
SELECT
config.id,
MAX(CAST(logs.value AS FLOAT)) AS max_accuracy
FROM
config
INNER JOIN
logs ON config.id = logs.id
WHERE
logs.name = 'accuracy'
GROUP BY
config.id
ORDER BY
max_accuracy DESC
LIMIT 5
Example 1: Example 2: Query the average loss for each learning rate
SELECT
config.learning_rate,
AVG(CAST(logs.value AS FLOAT)) AS avg_loss
FROM
config
INNER JOIN
logs ON config.id = logs.id
WHERE
logs.name = 'loss'
GROUP BY
config.learning_rate