地震预测
**
kaggle地震剩余时间预测挑战(一)
写在前面的话
这是我第一次参加kaggle的挑战,希望能够记录下来,并且给你们分享,在这里我会记录我的学到的东西,并且希望能够帮助到你们,能够共同进步。
简介
正确预测地震在预防死亡和损害中有非常积极的作用。这次竞赛我们尝试用地震信号数据预测实验室模拟地震发生的剩余时间。训练样本包含了一个非常大的数据(大约有六到七亿条),但是在这次竞赛中,我们有很多个分散的序列,对每个序列我们都要预测剩余时间。
在这篇文章中,我将尝试创造更多的特征并且使用更多的数据去训练。
(这里使用Python3语言,其中有很多语言包需要自己去安装,笔者直接使用的anaconda,这也是老师推荐的。)
关于数据:数据kaggle有提供,链接放在最后。
事先声明,目前我所做的只是在前人的基础上,这篇文章中你所看到的很大一部分都是英文翻译,原文链接我也会放在最后。
第一部分导入各种库
import numpy as np
import pandas as pd
import os
import matplotlib.pyplot as plt
%matplotlib inline
from tqdm import tqdm_notebook
from sklearn.preprocessing import Standardscaler
from sklearn.svm import NuSVR, SVR
from sklearn.metrics import mean_absolute_ERROR
pd.options.display.precision = 15
import lightgbm as lgb
import xgboost as xgb
import time
import datetime
from catboost import CatBoostRegressor
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import StratifiedKFold, KFold, RepeatedKFold
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import Linearregression
import gc
import seaborn as sns
import warnings
warnings.filterwarnings(“ignore”)
from scipy.signal import hilbert
from scipy.signal import hann
from scipy.signal import convolve
from scipy import stats
from sklearn.kernel_ridge import KernelRidge
这里不再过多介绍,基本的scipy、sklearn、numpy和anaconda大家都耳熟能详,值得一提的是,这里原作者使用了lgb库和xgb库,并且在最后用多种模型进行融合预测。
读取数据,初略的观察数据
(读取数据,数据量比较大,由于电脑性能问题,笔者读取了很长时间。)
%%time
train = pd.read_csv(’…/input/train.csv’, dtype={‘acoustic_data’: np.int16, ‘time_to_failure’: np.float32})
train_acoustic_data_small = train[‘acoustic_data’].values[::50]
train_time_to_failure_small = train[‘time_to_failure’].values[::50]
fig, ax1 = plt.subplots(figsize=(16, 8))
plt.title(“Trends of acoustic_data and time_to_failure. 2% of data (sampled)”)
plt.plot(train_acoustic_data_small, color=‘b’)
ax1.set_ylabel(‘acoustic_data’, color=‘b’)
plt.legend([‘acoustic_data’])
ax2 = ax1.twinx()
plt.plot(train_time_to_failure_small, color=‘g’)
ax2.set_ylabel(‘time_to_failure’, color=‘g’)
plt.legend([‘time_to_failure’], loc=(0.875, 0.9))
plt.grid(False)
del train_acoustic_data_small
del train_time_to_failure_small
我们可以看到acoustic data数据在地震之前表现出了非常大的波动并且这个地震数据表现出了周期性。
另一个重要的事情是:我们好像可以看出来地震发生都有一个特点:当巨大的信号波动发生在一个较小的信号值之后。对于预测”time_to_failure”从0开始增加,这可能是非常有用的。
我认为将连续信号最大值与某个阈值(1000或2000)进行比较是有用的,但没有起作用。
特征生成:
我创建了几组特征值。
常见的:平均值,标准差,最小值和最大值。
绝对值和百分比值的平均差。(average difference between the consequitive values in absolute and percent values;这是原文,但是这个consequitive笔者并不认识,猜测是原作者打错了,应该是consecutive)。
绝对的最小值和最大值。
前后一万到五万个值的聚合,我个人认为这些数据是有用的(原作者在另一片文章中展示出来了开头数据和最后数据是比较有用的)
最大值到最小值和 他们的差也大于500的值(任意阈值);
来自其他人的数据:
数字特征: https://www.kaggle.com/andrekos/basic-feature-benchmark-with-quantiles
趋势特征: https://www.kaggle.com/jsaguiar/baseline-with-abs-and-trend-features
波动特征: https://www.kaggle.com/wimwim/rolling-quantiles
#Create a training file with simple derived features
rows = 150_000
segments = int(np.floor(train.shape[0] / rows))
def add_trend_feature(arr, abs_values=False):
idx = np.array(range(len(arr)))
if abs_values:
arr = np.abs(arr)
lr = LinearRegression()
lr.fit(idx.reshape(-1, 1), arr)
return lr.coef_[0]
def classic_sta_lta(x, length_sta, length_lta):
sta = np.cumsum(x ** 2)
# Convert to float
sta = np.require(sta, dtype=np.float)
# Copy for LTA
lta = sta.copy()
# Compute the STA and the LTA
sta[length_sta:] = sta[length_sta:] - sta[:-length_sta]
sta /= length_sta
lta[length_lta:] = lta[length_lta:] - lta[:-length_lta]
lta /= length_lta
# Pad zeros
sta[:length_lta - 1] = 0
# Avoid pision by zero by setting zero values to tiny float
dtiny = np.finfo(0.0).tiny
idx = lta < dtiny
lta[idx] = dtiny
return sta / lta
X_tr = pd.DataFrame(index=range(segments), dtype=np.float64)
y_tr = pd.DataFrame(index=range(segments), dtype=np.float64, columns=[‘time_to_failure’])
total_mean = train[‘acoustic_data’].mean()
total_std = train[‘acoustic_data’].std()
total_max = train[‘acoustic_data’].max()
total_min = train[‘acoustic_data’].min()
total_sum = train[‘acoustic_data’].sum()
total_abs_sum = np.abs(train[‘acoustic_data’]).sum()
for segment in tqdm_notebook(range(segments)):
seg = train.iloc[segmentrows:segmentrows+rows]
x = pd.Series(seg[‘acoustic_data’].values)
y = seg[‘time_to_failure’].values[-1]
y_tr.loc[segment, 'time_to_failure'] = y
X_tr.loc[segment, 'mean'] = x.mean()
X_tr.loc[segment, 'std'] = x.std()
X_tr.loc[segment, 'max'] = x.max()
X_tr.loc[segment, 'min'] = x.min()
X_tr.loc[segment, 'mean_change_abs'] = np.mean(np.diff(x))
X_tr.loc[segment, 'mean_change_rate'] = np.mean(np.nonzero((np.diff(x) / x[:-1]))[0])
X_tr.loc[segment, 'abs_max'] = np.abs(x).max()
X_tr.loc[segment, 'abs_min'] = np.abs(x).min()
X_tr.loc[segment, 'std_first_50000'] = x[:50000].std()
X_tr.loc[segment, 'std_last_50000'] = x[-50000:].std()
X_tr.loc[segment, 'std_first_10000'] = x[:10000].std()
X_tr.loc[segment, 'std_last_10000'] = x[-10000:].std()
X_tr.loc[segment, 'avg_first_50000'] = x[:50000].mean()
X_tr.loc[segment, 'avg_last_50000'] = x[-50000:].mean()
X_tr.loc[segment, 'avg_first_10000'] = x[:10000].mean()
X_tr.loc[segment, 'avg_last_10000'] = x[-10000:].mean()
X_tr.loc[segment, 'min_first_50000'] = x[:50000].min()
X_tr.loc[segment, 'min_last_50000'] = x[-50000:].min()
X_tr.loc[segment, 'min_first_10000'] = x[:10000].min()
X_tr.loc[segment, 'min_last_10000'] = x[-10000:].min()
X_tr.loc[segment, 'max_first_50000'] = x[:50000].max()
X_tr.loc[segment, 'max_last_50000'] = x[-50000:].max()
X_tr.loc[segment, 'max_first_10000'] = x[:10000].max()
X_tr.loc[segment, 'max_last_10000'] = x[-10000:].max()
X_tr.loc[segment, 'max_to_min'] = x.max() / np.abs(x.min())
X_tr.loc[segment, 'max_to_min_diff'] = x.max() - np.abs(x.min())
X_tr.loc[segment, 'count_big'] = len(x[np.abs(x) > 500])
X_tr.loc[segment, 'sum'] = x.sum()
X_tr.loc[segment, 'mean_change_rate_first_50000'] = np.mean(np.nonzero((np.diff(x[:50000]) / x[:50000][:-1]))[0])
X_tr.loc[segment, 'mean_change_rate_last_50000'] = np.mean(np.nonzero((np.diff(x[-50000:]) / x[-50000:][:-1]))[0])
X_tr.loc[segment, 'mean_change_rate_first_10000'] = np.mean(np.nonzero((np.diff(x[:10000]) / x[:10000][:-1]))[0])
X_tr.loc[segment, 'mean_change_rate_last_10000'] = np.mean(np.nonzero((np.diff(x[-10000:]) / x[-10000:][:-1]))[0])
X_tr.loc[segment, 'q95'] = np.quantile(x, 0.95)
X_tr.loc[segment, 'q99'] = np.quantile(x, 0.99)
X_tr.loc[segment, 'q05'] = np.quantile(x, 0.05)
X_tr.loc[segment, 'q01'] = np.quantile(x, 0.01)
X_tr.loc[segment, 'abs_q95'] = np.quantile(np.abs(x), 0.95)
X_tr.loc[segment, 'abs_q99'] = np.quantile(np.abs(x), 0.99)
X_tr.loc[segment, 'abs_q05'] = np.quantile(np.abs(x), 0.05)
X_tr.loc[segment, 'abs_q01'] = np.quantile(np.abs(x), 0.01)
X_tr.loc[segment, 'trend'] = add_trend_feature(x)
X_tr.loc[segment, 'abs_trend'] = add_trend_feature(x, abs_values=True)
X_tr.loc[segment, 'abs_mean'] = np.abs(x).mean()
X_tr.loc[segment, 'abs_std'] = np.abs(x).std()
X_tr.loc[segment, 'mad'] = x.mad()
X_tr.loc[segment, 'kurt'] = x.kurtosis()
X_tr.loc[segment, 'skew'] = x.skew()
X_tr.loc[segment, 'med'] = x.median()
X_tr.loc[segment, 'Hilbert_mean'] = np.abs(hilbert(x)).mean()
X_tr.loc[segment, 'Hann_window_mean'] = (convolve(x, hann(150), mode='same') / sum(hann(150))).mean()
X_tr.loc[segment, 'classic_sta_lta1_mean'] = classic_sta_lta(x, 500, 10000).mean()
X_tr.loc[segment, 'classic_sta_lta2_mean'] = classic_sta_lta(x, 5000, 100000).mean()
X_tr.loc[segment, 'classic_sta_lta3_mean'] = classic_sta_lta(x, 3333, 6666).mean()
X_tr.loc[segment, 'classic_sta_lta4_mean'] = classic_sta_lta(x, 10000, 25000).mean()
X_tr.loc[segment, 'Moving_average_700_mean'] = x.rolling(window=700).mean().mean(skipna=True)
X_tr.loc[segment, 'Moving_average_1500_mean'] = x.rolling(window=1500).mean().mean(skipna=True)
X_tr.loc[segment, 'Moving_average_3000_mean'] = x.rolling(window=3000).mean().mean(skipna=True)
X_tr.loc[segment, 'Moving_average_6000_mean'] = x.rolling(window=6000).mean().mean(skipna=True)
ewma = pd.Series.ewm
X_tr.loc[segment, 'exp_Moving_average_300_mean'] = (ewma(x, span=300).mean()).mean(skipna=True)
X_tr.loc[segment, 'exp_Moving_average_3000_mean'] = ewma(x, span=3000).mean().mean(skipna=True)
X_tr.loc[segment, 'exp_Moving_average_30000_mean'] = ewma(x, span=6000).mean().mean(skipna=True)
no_of_std = 2
X_tr.loc[segment, 'MA_700MA_std_mean'] = x.rolling(window=700).std().mean()
X_tr.loc[segment,'MA_700MA_BB_high_mean'] = (X_tr.loc[segment, 'Moving_average_700_mean'] + no_of_std * X_tr.loc[segment, 'MA_700MA_std_mean']).mean()
X_tr.loc[segment,'MA_700MA_BB_low_mean'] = (X_tr.loc[segment, 'Moving_average_700_mean'] - no_of_std * X_tr.loc[segment, 'MA_700MA_std_mean']).mean()
X_tr.loc[segment, 'MA_400MA_std_mean'] = x.rolling(window=400).std().mean()
X_tr.loc[segment,'MA_400MA_BB_high_mean'] = (X_tr.loc[segment, 'Moving_average_700_mean'] + no_of_std * X_tr.loc[segment, 'MA_400MA_std_mean']).mean()
X_tr.loc[segment,'MA_400MA_BB_low_mean'] = (X_tr.loc[segment, 'Moving_average_700_mean'] - no_of_std * X_tr.loc[segment, 'MA_400MA_std_mean']).mean()
X_tr.loc[segment, 'MA_1000MA_std_mean'] = x.rolling(window=1000).std().mean()
X_tr.loc[segment, 'iqr'] = np.subtract(*np.percentile(x, [75, 25]))
X_tr.loc[segment, 'q999'] = np.quantile(x,0.999)
X_tr.loc[segment, 'q001'] = np.quantile(x,0.001)
X_tr.loc[segment, 'ave10'] = stats.trim_mean(x, 0.1)
for windows in [10, 100, 1000]:
x_roll_std = x.rolling(windows).std().dropna().values
x_roll_mean = x.rolling(windows).mean().dropna().values
X_tr.loc[segment, 'ave_roll_std_' + str(windows)] = x_roll_std.mean()
X_tr.loc[segment, 'std_roll_std_' + str(windows)] = x_roll_std.std()
X_tr.loc[segment, 'max_roll_std_' + str(windows)] = x_roll_std.max()
X_tr.loc[segment, 'min_roll_std_' + str(windows)] = x_roll_std.min()
X_tr.loc[segment, 'q01_roll_std_' + str(windows)] = np.quantile(x_roll_std, 0.01)
X_tr.loc[segment, 'q05_roll_std_' + str(windows)] = np.quantile(x_roll_std, 0.05)
X_tr.loc[segment, 'q95_roll_std_' + str(windows)] = np.quantile(x_roll_std, 0.95)
X_tr.loc[segment, 'q99_roll_std_' + str(windows)] = np.quantile(x_roll_std, 0.99)
X_tr.loc[segment, 'av_change_abs_roll_std_' + str(windows)] = np.mean(np.diff(x_roll_std))
X_tr.loc[segment, 'av_change_rate_roll_std_' + str(windows)] = np.mean(np.nonzero((np.diff(x_roll_std) / x_roll_std[:-1]))[0])
X_tr.loc[segment, 'abs_max_roll_std_' + str(windows)] = np.abs(x_roll_std).max()
X_tr.loc[segment, 'ave_roll_mean_' + str(windows)] = x_roll_mean.mean()
X_tr.loc[segment, 'std_roll_mean_' + str(windows)] = x_roll_mean.std()
X_tr.loc[segment, 'max_roll_mean_' + str(windows)] = x_roll_mean.max()
X_tr.loc[segment, 'min_roll_mean_' + str(windows)] = x_roll_mean.min()
X_tr.loc[segment, 'q01_roll_mean_' + str(windows)] = np.quantile(x_roll_mean, 0.01)
X_tr.loc[segment, 'q05_roll_mean_' + str(windows)] = np.quantile(x_roll_mean, 0.05)
X_tr.loc[segment, 'q95_roll_mean_' + str(windows)] = np.quantile(x_roll_mean, 0.95)
X_tr.loc[segment, 'q99_roll_mean_' + str(windows)] = np.quantile(x_roll_mean, 0.99)
X_tr.loc[segment, 'av_change_abs_roll_mean_' + str(windows)] = np.mean(np.diff(x_roll_mean))
X_tr.loc[segment, 'av_change_rate_roll_mean_' + str(windows)] = np.mean(np.nonzero((np.diff(x_roll_mean) / x_roll_mean[:-1]))[0])
X_tr.loc[segment, 'abs_max_roll_mean_' + str(windows)] = np.abs(x_roll_mean).max()
代码比较长,我也看了很久,这里一部分一部分的来解释。
rows = 150_000
segments = int(np.floor(train.shape[0] / rows))
这里是分成150000份。
def add_trend_feature(arr, abs_values=False):
idx = np.array(range(len(arr)))
if abs_values:
arr = np.abs(arr)
lr = LinearRegression()
lr.fit(idx.reshape(-1, 1), arr)
return lr.coef_[0]
这里是生成1,2,3,4,5,6,7,8,9……的基本数字序列(长度和输入的数据长度一样)并且和输入的数据做线性回归分析,并且返回变量系数。
def classic_sta_lta(x, length_sta, length_lta):
sta = np.cumsum(x ** 2)
# Convert to float
sta = np.require(sta, dtype=np.float)
# Copy for LTA
lta = sta.copy()
# Compute the STA and the LTA
sta[length_sta:] = sta[length_sta:] - sta[:-length_sta]
sta /= length_sta
lta[length_lta:] = lta[length_lta:] - lta[:-length_lta]
lta /= length_lta
# Pad zeros
sta[:length_lta - 1] = 0
# Avoid pision by zero by setting zero values to tiny float
dtiny = np.finfo(0.0).tiny
idx = lta < dtiny
lta[idx] = dtiny
return sta / lta
这里用来计算sta和lta的比值,这两个数据在地球物理中用来预测地震的。
X_tr = pd.DataFrame(index=range(segments), dtype=np.float64)
y_tr = pd.DataFrame(index=range(segments), dtype=np.float64, columns=[‘time_to_failure’])
这里用于生成X_tr数据和y_tr数据。这两个数据分辨选择了前1500分之1的数据,也就是4194个数据。
然后一大串是 138个特征值,其中包含各种统计学中常用的用来描述数据的值。这里不依次叙述,想了解的可以自己再去查阅。
数据集的生成
关于样本
我尝试了随机150000个数据几千次计算并且添加了这些数据到训练数据集,但是他降低了我的分数。
np.abs(X_tr.corrwith(y_tr[‘time_to_failure’])).sort_values(ascending=False).head(12)
计算各个特征值和y值的相关性,并且选出排在前面的十二个。
可以发现大部分相关的都是roll的结果,而且还主要是前百分之五和最后百分之五的数据,这说明用这些数据去预测地震发生的剩余时间更为可靠,这个函数主要是时间窗函数,可以看这篇博客,https://blog.csdn.net/wj1066/article/details/78853717。
scaler = StandardScaler()
scaler.fit(X_tr)
X_train_scaled = pd.DataFrame(scaler.transform(X_tr), columns=X_tr.columns)
然后是数据归一化,这对之后的模型构建非常有用。
紧跟着读取数据集,并且计算出每个数据的特征值。
submission = pd.read_csv(’…/input/sample_submission.csv’, index_col=‘seg_id’)
X_test = pd.DataFrame(columns=X_tr.columns, dtype=np.float64, index=submission.index)
plt.figure(figsize=(22, 16))
for i, seg_id in enumerate(tqdm_notebook(X_test.index)):
seg = pd.read_csv(’…/input/test/’ + seg_id + ‘.csv’)
x = pd.Series(seg['acoustic_data'].values)
X_test.loc[seg_id, 'mean'] = x.mean()
X_test.loc[seg_id, 'std'] = x.std()
X_test.loc[seg_id, 'max'] = x.max()
X_test.loc[seg_id, 'min'] = x.min()
X_test.loc[seg_id, 'mean_change_abs'] = np.mean(np.diff(x))
X_test.loc[seg_id, 'mean_change_rate'] = np.mean(np.nonzero((np.diff(x) / x[:-1]))[0])
X_test.loc[seg_id, 'abs_max'] = np.abs(x).max()
X_test.loc[seg_id, 'abs_min'] = np.abs(x).min()
X_test.loc[seg_id, 'std_first_50000'] = x[:50000].std()
X_test.loc[seg_id, 'std_last_50000'] = x[-50000:].std()
X_test.loc[seg_id, 'std_first_10000'] = x[:10000].std()
X_test.loc[seg_id, 'std_last_10000'] = x[-10000:].std()
X_test.loc[seg_id, 'avg_first_50000'] = x[:50000].mean()
X_test.loc[seg_id, 'avg_last_50000'] = x[-50000:].mean()
X_test.loc[seg_id, 'avg_first_10000'] = x[:10000].mean()
X_test.loc[seg_id, 'avg_last_10000'] = x[-10000:].mean()
X_test.loc[seg_id, 'min_first_50000'] = x[:50000].min()
X_test.loc[seg_id, 'min_last_50000'] = x[-50000:].min()
X_test.loc[seg_id, 'min_first_10000'] = x[:10000].min()
X_test.loc[seg_id, 'min_last_10000'] = x[-10000:].min()
X_test.loc[seg_id, 'max_first_50000'] = x[:50000].max()
X_test.loc[seg_id, 'max_last_50000'] = x[-50000:].max()
X_test.loc[seg_id, 'max_first_10000'] = x[:10000].max()
X_test.loc[seg_id, 'max_last_10000'] = x[-10000:].max()
X_test.loc[seg_id, 'max_to_min'] = x.max() / np.abs(x.min())
X_test.loc[seg_id, 'max_to_min_diff'] = x.max() - np.abs(x.min())
X_test.loc[seg_id, 'count_big'] = len(x[np.abs(x) > 500])
X_test.loc[seg_id, 'sum'] = x.sum()
X_test.loc[seg_id, 'mean_change_rate_first_50000'] = np.mean(np.nonzero((np.diff(x[:50000]) / x[:50000][:-1]))[0])
X_test.loc[seg_id, 'mean_change_rate_last_50000'] = np.mean(np.nonzero((np.diff(x[-50000:]) / x[-50000:][:-1]))[0])
X_test.loc[seg_id, 'mean_change_rate_first_10000'] = np.mean(np.nonzero((np.diff(x[:10000]) / x[:10000][:-1]))[0])
X_test.loc[seg_id, 'mean_change_rate_last_10000'] = np.mean(np.nonzero((np.diff(x[-10000:]) / x[-10000:][:-1]))[0])
X_test.loc[seg_id, 'q95'] = np.quantile(x,0.95)
X_test.loc[seg_id, 'q99'] = np.quantile(x,0.99)
X_test.loc[seg_id, 'q05'] = np.quantile(x,0.05)
X_test.loc[seg_id, 'q01'] = np.quantile(x,0.01)
X_test.loc[seg_id, 'abs_q95'] = np.quantile(np.abs(x), 0.95)
X_test.loc[seg_id, 'abs_q99'] = np.quantile(np.abs(x), 0.99)
X_test.loc[seg_id, 'abs_q05'] = np.quantile(np.abs(x), 0.05)
X_test.loc[seg_id, 'abs_q01'] = np.quantile(np.abs(x), 0.01)
X_test.loc[seg_id, 'trend'] = add_trend_feature(x)
X_test.loc[seg_id, 'abs_trend'] = add_trend_feature(x, abs_values=True)
X_test.loc[seg_id, 'abs_mean'] = np.abs(x).mean()
X_test.loc[seg_id, 'abs_std'] = np.abs(x).std()
X_test.loc[seg_id, 'mad'] = x.mad()
X_test.loc[seg_id, 'kurt'] = x.kurtosis()
X_test.loc[seg_id, 'skew'] = x.skew()
X_test.loc[seg_id, 'med'] = x.median()
X_test.loc[seg_id, 'Hilbert_mean'] = np.abs(hilbert(x)).mean()
X_test.loc[seg_id, 'Hann_window_mean'] = (convolve(x, hann(150), mode='same') / sum(hann(150))).mean()
X_test.loc[seg_id, 'classic_sta_lta1_mean'] = classic_sta_lta(x, 500, 10000).mean()
X_test.loc[seg_id, 'classic_sta_lta2_mean'] = classic_sta_lta(x, 5000, 100000).mean()
X_test.loc[seg_id, 'classic_sta_lta3_mean'] = classic_sta_lta(x, 3333, 6666).mean()
X_test.loc[seg_id, 'classic_sta_lta4_mean'] = classic_sta_lta(x, 10000, 25000).mean()
X_test.loc[seg_id, 'Moving_average_700_mean'] = x.rolling(window=700).mean().mean(skipna=True)
X_test.loc[seg_id, 'Moving_average_1500_mean'] = x.rolling(window=1500).mean().mean(skipna=True)
X_test.loc[seg_id, 'Moving_average_3000_mean'] = x.rolling(window=3000).mean().mean(skipna=True)
X_test.loc[seg_id, 'Moving_average_6000_mean'] = x.rolling(window=6000).mean().mean(skipna=True)
ewma = pd.Series.ewm
X_test.loc[seg_id, 'exp_Moving_average_300_mean'] = (ewma(x, span=300).mean()).mean(skipna=True)
X_test.loc[seg_id, 'exp_Moving_average_3000_mean'] = ewma(x, span=3000).mean().mean(skipna=True)
X_test.loc[seg_id, 'exp_Moving_average_30000_mean'] = ewma(x, span=6000).mean().mean(skipna=True)
no_of_std = 2
X_test.loc[seg_id, 'MA_700MA_std_mean'] = x.rolling(window=700).std().mean()
X_test.loc[seg_id,'MA_700MA_BB_high_mean'] = (X_test.loc[seg_id, 'Moving_average_700_mean'] + no_of_std * X_test.loc[seg_id, 'MA_700MA_std_mean']).mean()
X_test.loc[seg_id,'MA_700MA_BB_low_mean'] = (X_test.loc[seg_id, 'Moving_average_700_mean'] - no_of_std * X_test.loc[seg_id, 'MA_700MA_std_mean']).mean()
X_test.loc[seg_id, 'MA_400MA_std_mean'] = x.rolling(window=400).std().mean()
X_test.loc[seg_id,'MA_400MA_BB_high_mean'] = (X_test.loc[seg_id, 'Moving_average_700_mean'] + no_of_std * X_test.loc[seg_id, 'MA_400MA_std_mean']).mean()
X_test.loc[seg_id,'MA_400MA_BB_low_mean'] = (X_test.loc[seg_id, 'Moving_average_700_mean'] - no_of_std * X_test.loc[seg_id, 'MA_400MA_std_mean']).mean()
X_test.loc[seg_id, 'MA_1000MA_std_mean'] = x.rolling(window=1000).std().mean()
X_test.loc[seg_id, 'iqr'] = np.subtract(*np.percentile(x, [75, 25]))
X_test.loc[seg_id, 'q999'] = np.quantile(x,0.999)
X_test.loc[seg_id, 'q001'] = np.quantile(x,0.001)
X_test.loc[seg_id, 'ave10'] = stats.trim_mean(x, 0.1)
for windows in [10, 100, 1000]:
x_roll_std = x.rolling(windows).std().dropna().values
x_roll_mean = x.rolling(windows).mean().dropna().values
X_test.loc[seg_id, 'ave_roll_std_' + str(windows)] = x_roll_std.mean()
X_test.loc[seg_id, 'std_roll_std_' + str(windows)] = x_roll_std.std()
X_test.loc[seg_id, 'max_roll_std_' + str(windows)] = x_roll_std.max()
X_test.loc[seg_id, 'min_roll_std_' + str(windows)] = x_roll_std.min()
X_test.loc[seg_id, 'q01_roll_std_' + str(windows)] = np.quantile(x_roll_std, 0.01)
X_test.loc[seg_id, 'q05_roll_std_' + str(windows)] = np.quantile(x_roll_std, 0.05)
X_test.loc[seg_id, 'q95_roll_std_' + str(windows)] = np.quantile(x_roll_std, 0.95)
X_test.loc[seg_id, 'q99_roll_std_' + str(windows)] = np.quantile(x_roll_std, 0.99)
X_test.loc[seg_id, 'av_change_abs_roll_std_' + str(windows)] = np.mean(np.diff(x_roll_std))
X_test.loc[seg_id, 'av_change_rate_roll_std_' + str(windows)] = np.mean(np.nonzero((np.diff(x_roll_std) / x_roll_std[:-1]))[0])
X_test.loc[seg_id, 'abs_max_roll_std_' + str(windows)] = np.abs(x_roll_std).max()
X_test.loc[seg_id, 'ave_roll_mean_' + str(windows)] = x_roll_mean.mean()
X_test.loc[seg_id, 'std_roll_mean_' + str(windows)] = x_roll_mean.std()
X_test.loc[seg_id, 'max_roll_mean_' + str(windows)] = x_roll_mean.max()
X_test.loc[seg_id, 'min_roll_mean_' + str(windows)] = x_roll_mean.min()
X_test.loc[seg_id, 'q01_roll_mean_' + str(windows)] = np.quantile(x_roll_mean, 0.01)
X_test.loc[seg_id, 'q05_roll_mean_' + str(windows)] = np.quantile(x_roll_mean, 0.05)
X_test.loc[seg_id, 'q95_roll_mean_' + str(windows)] = np.quantile(x_roll_mean, 0.95)
X_test.loc[seg_id, 'q99_roll_mean_' + str(windows)] = np.quantile(x_roll_mean, 0.99)
X_test.loc[seg_id, 'av_change_abs_roll_mean_' + str(windows)] = np.mean(np.diff(x_roll_mean))
X_test.loc[seg_id, 'av_change_rate_roll_mean_' + str(windows)] = np.mean(np.nonzero((np.diff(x_roll_mean) / x_roll_mean[:-1]))[0])
X_test.loc[seg_id, 'abs_max_roll_mean_' + str(windows)] = np.abs(x_roll_mean).max()
if i < 12:
plt.subplot(6, 4, i + 1)
plt.plot(seg['acoustic_data'])
plt.title(seg_id)
X_test_scaled = pd.DataFrame(scaler.transform(X_test), columns=X_test.columns)
可以看到2624个数据集的特征值都已准备好。
紧跟着建立模型。
n_fold = 5
folds = KFold(n_splits=n_fold, shuffle=True, random_state=11)
这里kfold函数确定五组交叉验证集。
紧跟着,设置了138个特征值,这些特征值是统计学上用来描述数据的各种特征。
到这里,原作者已经基本搭建好了数据集。
下一节我会针对原作者的训练模型做进一步分析。
最后,本人也是菜鸟,只是对这方面感兴趣,希望能够记录下来,欢迎大家指正。
原文链接:https://www.kaggle.com/artgor/earthquakes-fe-more-features-and-samples
相关阅读
DNF10周年活动时间预测 DNF国服10周年活动什么时候开
dnf国服10周年活动什么时候开始?dnf即将迎来dnf国服十周年,一般活动都会在提前发喔,接下来一起来了解下dnf十周年活动预测吧!dnf国
在大变革来临之际,如果你的方向是错的,不顺应时代的潮流,不做出应对之策,在任何行业都没办法生存。即便是再庞大的商业帝国,也不能幸
月份 投标人数 额度 中标率 最低成交价 成交均价 成交时间 1月 168614 12832 7.6%
马尔可夫预测的性质及运用 对事件的全面预测,不仅要能够指出事件发生的各种可能结果,而且还必须给出每一种结果出现的概率,说明被预
自从新浪微博对网页端进行了版本为V6的更新后,微博中,不断有对新版本的谩骂声,认为这是一次失败的升级。当然,笔者在升级之后,也感受到