How I Got In The Top 1% of A Kaggle Competition With kxy And No Hyper-Parameter Tuning

The solution that ranked 26th/1946 in the G-Research Crypto Forecasting Kaggle competition.

· 6 min read
How I Got In The Top 1% of A Kaggle Competition With kxy And No Hyper-Parameter Tuning
Photo by Joshua Golde / Unsplash

The G-Research Crypto Forecasting Kaggle competition was my first Kaggle competition using the kxy package, and I managed to finish 26th out of 1946 teams, with the kxy package, LightGBM, no hyper-parameter tuning, and only 2 submissions (one test and one real)!

In this post I share my solution and explain why the kxy package was key.

The Competition

The aim of the competition was to predict the price move (net of market trends) of some cryptocurrencies over the next 15 minutes, using historical trade data aggregated in minute bars up to now.

The Data

Specifically, each row  of the training data CSV file summarizes trades in a cryptocurrency that took place in a given minute.

Here's the full list of attributes of each row:

  • timestamp - A timestamp for the minute covered by the row.
  • Asset_ID - An ID code for the cryptocurrency.
  • Count - The number of trades that took place this minute.
  • Open - The USD price at the beginning of the minute.
  • High - The highest USD price during the minute.
  • Low - The lowest USD price during the minute.
  • Close - The USD price at the end of the minute.
  • Volume - The number of cryptocurrency units traded during the minute.
  • VWAP - The volume weighted average price for the minute.
  • Target - The residualized log-return over the 15 minutes following the current minute. For details of how the target is calculated see here.

Additional information about cryptocurrencies include:

  • Asset_ID - An ID code for the cryptocurrency.
  • Name - The real name of the cryptocurrency associated to Asset_ID.
  • Weight - The weight that the cryptocurrency associated to Asset_ID receives in the evaluation metric.

The Evaluation Criteria

Teams were ranked based on the weighted average (across cryptocurrencies) of the Pearson correlation between their predictions of Target and the ground truth during the live testing phase. Pearson correlations were weighted using the Weight column above.

My Solution

My solution relied on three essential building blocks: feature construction, training data resampling, and effective feature selection.

Feature Construction

The goal of feature construction is to turn raw data (i.e. Open, High, Low, Close, Volume, VWAP at a given time as well as their past values) into representations that can potentially have much simpler relationships to the target; for instance, features that we reasonably believe can have a monotonous relationship to the target, its magnitude (i.e. absolute value), or its sign.

Atemporal Features

I began by constructing features using information about the last trading bar that are more likely to reveal the direction and/or the magnitude of future market moves than raw attributes such as Open, High, Low, Close and VWAP.

  • UPS = High-max(Close, Open) - Upper shadow; i.e. how high did the price get in the minute bar relative to where it started and closed. High values of this feature could indicate short-term reversions, low values could indicate short-term momentum/trend.
  • LOS = min(Close, Open)-Low) - Lower shadow; i.e. how low did the price get in the minute bar relative to where it started and closed. High values of this feature could indicate short-term reversions, low values could indicate short-term momentum/trend.
  • RNG = (High-Low)/VWAP - High-low range; i.e. how did the price fluctuate in the minute bar. This feature could be perceived as a volatility indicator. High (resp. low) values could indicate that the target might be high (resp. low) in absolute value.
  • MOV = (Close-Open)/VWAP - Price move in the minute bar. This feature could be perceived as a momentum/trend indicator.  High (resp. low) values could indicate that the target might be high (resp. low).
  • LOGRETCO = log(Close/Open) - Log-return Open-to-Close. This feature could be perceived as a momentum/trend indicator.  High (resp. low) values could indicate that the target might be high (resp. low).
  • RANKLOGRETCO = LOGRETCO.groupby('timestamp').rank('dense') - How the Open-to-Close log-return of the cryptocurrency the row pertains to, ranks relative to the Open-to-Close log-returns of other cryptocurrencies in the same minute bar.
  • CLS = (Close-VWAP)/VWAP - The last trade price in the bar relative to the average trade price. This could be perceived as a momentum/trend indicator. High (resp. low) values could indicate that prices are trending up (resp. down).
  • LOGVOL = log(1+Volume) - Logarithm of the trade volume. Same as Volume but more robust to outliers. High volume could be an indication that returns will persist in the near future. Low volume and high returns could indicate a future market correction.
  • LOGCNT = log(1+Count) - Logarithm of the number of trades in the bar. Same as Count but more robust to outliers. High count could be an indication that returns will persist in the near future. Low count and high returns could indicate a future market correction.
  • VOL2CNT = Volume/(1+Count) - Average trade size in the bar. Can reveal whether there is a big player trading a large block, which would move the price further.

Temporal Features

Next, for each of the features above, I constructed the following 4 features to summarize values of the feature over the past 15 minutes. The idea was to capture short-term trends and extrema, and to gauge the short-term stability of the feature.

While atemporal features can be indicative of future market moves, these temporal aggregation can add to their forecasting power by providing useful temporal context.

  • *.GROUPBY(Asset_ID).LAST(15).MEAN() - Average of the feature over the past 15 minutes for a specific cryptocurrency.
  • *.GROUPBY(Asset_ID).LAST(15).MAX() - Maximum of the feature over the past 15 minutes for a specific cryptocurrency.
  • *.GROUPBY(Asset_ID).LAST(15).MIN() - Minimum of the feature over the past 15 minutes for a specific cryptocurrency.
  • *.GROUPBY(Asset_ID).LAST(15).MAX-MIN() - Difference between the maximum and the minimum of the feature over the past 15 minutes for a specific cryptocurrency.

Seasonality Features

I used the following features to help detect possible seasonal patterns.

  • timestamp.HOUR() - Hour in the day.
  • timestamp.DAYOFWEEK() - The day of the week with Monday=0, Sunday=6.
  • timestamp.DAY() - Day of the month.
  • timestamp.MONTH() - Month in the year.

Asset-Specific Features

I chose to build a single model to trade all cryptocurrencies as I expect the same market players to be active in all cryptocurrencies and, as a result, inefficiencies in all cryptocurrencies should be of similar types. However, I added Asset_ID as a feature to give the model the ability to learn idiosyncratic patterns. This brings the total number of candidate features to 56.

Feature Selection and Model Training

Financial markets, both traditional and decentralized, tend to undergo frequent and aggressive data distribution drifts due to, among other factors, geopolitical and other macroeconomic events.

As such, during training, data that are a few months old should not be given the same importance as recently observed data.

Additionally, care should be taken to avoid using too many features or, equivalently, to incorporate effective feature selection in the modeling pipeline. The more features are used in a model, the more the model is exposed to data distribution drifts.

Training Data

To construct our training data, I first constructed features using the full raw training data. I then proceeded a quarter at a time, in chronological order, starting from 2018Q1. The training data was initialized to 2018Q1 feature data. For new quarters ranging from 2018Q2 to 2021Q2, I updated the training data by sampling 80% of the new quarter data and 20% of the old training data.

This autoregressive style sampling scheme ensures that we overweight recent observations, while exponentially decaying the importance of past observations.

Testing Data

To test the method on a rolling basis, the new quarter can be used as testing data, and the training data before resampling can be used as training data. Ultimately, I only constructed the training data at the end of 2021Q2 for training and validation and used 2021Q3 as testing data.

Model Choice

I used LightGBM  with default hyper-parameters.

from kxy.learning import get_lightgbm_learner_learning_api
params = {'objective': 'rmse'}
learner_func = get_lightgbm_learner_learning_api(params)

Feature Selection

I used the kxy package to wrap LeanML feature selection around LightGBM training.

import kxy

# Jointly train the model and select features using 
# 	LeanML feature selection.
target_column = 'Target'
results = training_features_df.kxy.fit(target_column, \
	learner_func, problem_type='regression')
    
# Retrieve the trained predictor
predictor = results['predictor']

# Make predictions
predictions = predictor.predict(testing_features_df)

# Selected features
selected_features = predictor.selected_variables

Out of the original 56 candidate features, here is the list of the 28 features selected by LeanML, in the order they were selected:

  • RANKLOGRETCO.GROUPBY(Asset_ID).LAST(15).MEAN()
  • LOGRETCO.GROUPBY(Asset_ID).LAST(15).MAX()
  • RNG.GROUPBY(Asset_ID).LAST(15).MIN()
  • LOGRETCO.GROUPBY(Asset_ID).LAST(15).MEAN()
  • Asset_ID
  • timestamp.HOUR()
  • timestamp.DAY()
  • VOL2CNT.GROUPBY(Asset_ID).LAST(15).MAX-MIN()
  • VOL2CNT
  • LOGVOL
  • LOGCNT.GROUPBY(Asset_ID).LAST(15).MIN()
  • VOL2CNT.GROUPBY(Asset_ID).LAST(15).MIN()
  • LOGVOL.GROUPBY(Asset_ID).LAST(15).MIN()
  • LOGVOL.GROUPBY(Asset_ID).LAST(15).MEAN()
  • LOGCNT.GROUPBY(Asset_ID).LAST(15).MAX()
  • LOS
  • UPS.GROUPBY(Asset_ID).LAST(15).MAX()
  • UPS.GROUPBY(Asset_ID).LAST(15).MAX-MIN()
  • LOS.GROUPBY(Asset_ID).LAST(15).MAX()
  • LOGCNT.GROUPBY(Asset_ID).LAST(15).MAX-MIN()
  • UPS.GROUPBY(Asset_ID).LAST(15).MIN()
  • VOL2CNT.GROUPBY(Asset_ID).LAST(15).MEAN()
  • LOGCNT.GROUPBY(Asset_ID).LAST(15).MEAN()
  • LOS.GROUPBY(Asset_ID).LAST(15).MAX-MIN()
  • CLS.GROUPBY(Asset_ID).LAST(15).MAX()
  • RNG.GROUPBY(Asset_ID).LAST(15).MAX-MIN()
  • CLS.GROUPBY(Asset_ID).LAST(15).MEAN()
  • MOV

Benefits of the kxy Package

While feature construction and the autoregressive style resampling scheme contributed to this solution ranking in the top 1% with just one submission, what truly made a difference is the LeanML feature selection of the kxy package.

Throughout the live testing phase of the competition, my submission oscillated between the 92nd and the 163rd place on the public leaderboard, but I finished in 26th place, an indication that dozens of top submissions were penalized by data distribution drifts.

This is a testament to the ability of kxy's LeanML feature selection to mitigate data distribution drifts by drastically shrinking the number of features used, while preserving forecasting power.

See this blog post for an extensive comparison between LeanML and other feature selection methods.