Godaddy（1/5）时间序列异常检测_序列_模子

2025-01-09 06:18:04 科技快讯

据我所知，韶光序列当中存在的非常值如下几种情形：

Additive -（AO）加性非常值不会持续存在。
顾名思义，加性非常值只影响非常值发生的韶光的不雅观测值。
因此，加性非常值对未来的预测没有影响，可以理解为是一种伶仃事宜。
Innovational -（IO）这种类型的非常值不仅对 X(T) 造成滋扰，还会影响韶光 T 之后序列中所有的不雅观测值。
可以理解为干预性非常值，一种脉冲性冲击。
Level shift -（LS）它们的浸染是在非常值发生的韶光点开始提高或降落系列的均值。
均值的变革是溘然且永久性的。
可以利用阶跃函数函数来建模，其幅度即是omega参数。
Transient -（TS）这种类型的非常值在时候 T 发生时会带有某种初始影响，然后根据衰减因子的大小，在一段韶光后呈指数衰减，终极归于常态。
趋势变动 - 这种是数据模式上发生的改变，可能没有溘然涌现的异动，但是整体构造发生了变动。
Additive

Godaddy（1/5）时间序列异常检测_序列_模子科技快讯

这是一种加性的非常，更像是一过性的变革，没有对周围数据产生影响，一样平常情形下，这种情形一样平常直接删除该段数据或者用周围的数据进行添补。

Innovational

这是一种类似冲击的非常，对数据产生颠簸也可能是趋势的影响。
图中在黄线之后数据颠簸幅度变大，均值也些许提升。
在业务场景中，更像是公司政策亦或者政府政策产生某种持续性的影响。

Level shift

这是一种类似跳跃的水平移位非常且是多次移位，韶光序列被分为4个不同均值水平的部分。
每个部分的开头对付前一个韶光段来说都是非常值。
这些非常值不像范例的非常点，而是当前值跳跃并坚持一段韶光的某个水平。
这种情形使得历史数据对将来预测的参考意义降落，由于该部分看起来更像来自另一个序列而不是它自己。
因此，模式学习变得困难。

Transient

这种变革非常类似双11，一种营销活动产生的短韶光影响，图中在韶光节点11开始突增，在17旁边逐步回落到常态。
很显然，如果从历史（这段数据韶光范围内）中找履历很难预测到这段变革。

GoDaddy非常检测

前面我们剖析了韶光序列非常值的类型，与可能存在的影响，接下来我们将利用Godaddy竞赛数据进行非常检测。

比赛的背景

比赛主理方GoDaddy是环球最大的企业家做事平台。
他们的义务是通过为环球超过2000万客户和所有企业家供应所需的所有帮助和工具，帮助他们在网络上发展。
这个竞赛的目标是预测一个给定地区每月的微型企业密度。
须要利用美国县级数据演习出一个准确的模型。

我们将从以下几个方面，进行非常检测。

基于差分的LOF - 这个是自定义的检测方法，能够一定程度上检测到LS，IO，AO，乃至TC。
缺陷是无法剖断详细非常的类型。
chow test - 能够进行趋势改变的检测，进行构造性变动的检测。
缺陷是须要自定义窗口进行检测，同时也无法检测出所有的构造性变动基于ARIMA的检测 - 优点是效率高，相对精准，但是存在利用门槛，须要进行其他的结合进行详细非常类型的剖断其他基于差分的LOF

以下检测方法是一种自定义的非常检测方法。
它利用LOF检测差分序列中的非常值并将其定位。
此方法可以检测各种非常，包括IO、AO、LS、TC等。
但是，定位可能存在一些不准确性，须要与其他方法结合利用以得到更精确的结果。
下面是我自定义函数CusOulierDet，输入是韶光序列的list，输出是，所有的断点checkpoint以及根据断点切割出来的片段segment

def CusOulierDet(ts: List[float], threshold: float = -3.5) -> Tuple[List[List[float]], List[int]]: """ This function detects outliers in a time series and splits it into multiple segments for further processing. :param ts: The time series data. :param threshold: The threshold for the outlier score. Default is -3.5. :return: A list of all segments and a list of the split points. """ # Calculate the differences between consecutive values in the time series differences = [ts[i+1] - ts[i] for i in range(len(ts)-1)] # Convert the differences into a two-dimensional array X = [[val] for val in differences] # Create the Local Outlier Factor (LOF) model and set hyperparameters and algorithm parameters lof = LocalOutlierFactor(n_neighbors=20, contamination=0.05, novelty=True) # Train the model and get the outlier scores for each data point lof.fit(X) scores = lof.negative_outlier_factor_ # Identify the positions of the outliers anomalies = [] for i, score in enumerate(scores): if score < threshold: anomalies.append(i+1) # If there are no outliers, return the original time series and an empty list of split points if not anomalies: return [ts], [] # Split the original time series into segments based on the positions of the outliers and record the split points segments = [] split_points = [] start_idx = 0 for anomaly in anomalies: segments.append(ts[start_idx:anomaly]) split_points.append(anomaly) start_idx = anomaly # Add the final segment to the list of segments segments.append(ts[start_idx:]) # Return the list of segments and the list of split points return segments, split_points

从上面结果来看，切割变点特殊是LF，如序列cfip（县ID）17055，存在两次跳跃，13219存在一次跳跃。

chow test

Chow test是一种统计考验方法，用于考验一个线性回归模型中，是否存在构造性断点或者说是参数在断点处发生了变革。
在Chow test中，将数据集划分为两个或多个子样本，然后估计每个子样本中的回归模型，并打算总体回归模型与子样本回归模型之间的差异。
如果总体回归模型与子样本回归模型之间存在显著差异，则表明存在构造性断点。

Chow test常日用于韶光序列剖析、面板数据剖析等领域，可以用于考验某个韶光点或者某个事宜引起了模型参数的突变。
它的优点是易于实现，但缺陷是须要对数据进行分割，而且在断点位置不清楚时，可能会产生误判。
下面构建根本的chowtest函数，输入是韶光x，序列值y，以及选择检测点checkpoint，输出时检测显著性P值。

def chow_test(y, x, breakpoint): # Add a column for the intercept x = sm.add_constant(x) # Fit a linear regression model using OLS model_full = sm.OLS(y, x) results_full = model_full.fit() # Split the data into two parts x_left, x_right = x[:breakpoint], x[breakpoint:] y_left, y_right = y[:breakpoint], y[breakpoint:] # Fit linear regression models for each part model_left = sm.OLS(y_left, x_left) model_right = sm.OLS(y_right, x_right) # Compute the Chow test statistic and p-value chow_num = (results_full.ssr - (model_left.fit().ssr + model_right.fit().ssr)) / 2 chow_denom = (model_left.fit().ssr + model_right.fit().ssr) / (len(y_left) + len(y_right) - 4) chow_stat = chow_num / chow_denom p_value = 1 - stsf.cdf(chow_stat, 2, len(y_left) + len(y_right) - 4) return p_value

定义窗口进行chow test

side_winds = 7sample_size = 50for cfip in np.random.choice(train.cfips.unique(), size=sample_size, replace=False):# cfip = 28101 Candidate_dict = {} threadhold = 0.05 last_candi = 0 df = train.loc[train.cfips==cfip,'microbusiness_density'].values xts = np.arange(len(df)) for breakpoint in xts[side_winds:-side_winds]:# print(f'breakpoint:{breakpoint}') p_value = chow_test(df[breakpoint-side_winds:breakpoint+side_winds], xts[breakpoint-side_winds:breakpoint+side_winds], side_winds)# print(f'p_value:{p_value}') # print(p_value ) if p_value<threadhold: set_candi_condi = (last_candi in Candidate_dict) & (0<breakpoint- last_candi< side_winds)# print(f'last_candi:{last_candi}') # print( set_candi_condi,(last_candi in Candidate_dict),(0<breakpoint- last_candi< side_winds) ) if set_candi_condi: if Candidate_dict[last_candi]>p_value: Candidate_dict[breakpoint] = p_value del Candidate_dict[last_candi] last_candi = breakpoint else: last_candi = breakpoint Candidate_dict[breakpoint] = p_value Candidate = sorted(Candidate_dict.items(), key=lambda x: x[1])[:3] Candidate = sorted([x[0] for x in Candidate if x[1]<0.001]) print('\n\n Candidate:',Candidate) pltObj = MyPlots(x=xts, y=df,barOrLine='line', title='Detecting ouliers as change of trends ', SeriselableName='microbusiness_density', x_label='month', y_label='microbusiness_density', main_title=f'CHOW TEST OF CFIP:{cfip}') min_ = df.min() max_ = df.max() pltObj.vlines(Candidate , ymin=min_ ,ymax= max_, linewidth=5,color='#F4B400', alpha=0.5) pltObj.show()

结果显示，一定程度上的检测出了却构性变动的节点，如42009，趋势由增到减。
同时也创造一些问题，chowtest对付序列中的跳跃非常比较敏感，如18133，将其阶跃而造成的变动，当成一种构造或者趋势的变革。
因此很有必要多种手段进行结合。

基于ARIMA的检测

这里我们将ARIMA，设置为无差分情形下的模型，避免其削弱非常的检测。
只考虑AR(p)与MA(q)的情形。
通过残差比拟，找到异动点，进而剖断非常。

sample_size = 5for cfip in np.random.choice(train.cfips.unique(), size=sample_size, replace=False): df = train.loc[train.cfips==cfip,'microbusiness_density'].values # Fit ARIMA model model = ARIMA(df, order=(1, 0, 1)) model_fit = model.fit() # Calculate residuals residuals = pd.Series(model_fit.resid).abs() forecat = model_fit.forecasts[0] # Calculate mean and standard deviation of residuals mu = residuals[1:].mean() sigma = residuals[1:].std() # Detect temporary change threshold = mu + 3 sigma # Set threshold to 3 sigma temporary_change = residuals[residuals > threshold] pltObj = MyPlots(x=xts, y=df,barOrLine='line', title='Detecting ouliers as change of trends ', SeriselableName='microbusiness_density', x_label='month', y_label='microbusiness_density', main_title=f'ar_p:{ar_p} ma_q:{ma_q} ARIMA TEST OF CFIP:{cfip}', serise1=forecat, color1 ='#DB4437', lableName1 = 'ARIMA Forecast' ) min_ = df.min() max_ = df.max() pltObj.vlines(temporary_change.index , ymin=min_ ,ymax= max_, linewidth=5,color='#F4B400', alpha=0.5) pltObj.show()

在所有工具中，ARIMA结果该当是相比拟较满意的情形，多数情形都能进行很好的创造，当然也不乏一些比如，比如18021这种构造性变革情形，ARIMA是不能创造的。

期待

限于韶光和篇幅，我现在就先讲到这里。
在ARIMA方法中，也容许以通过逐步窗口添加的并预测，然后剖析残差，趋势等以及设计一些仿照函数可以准确的定位到非常的类型。
末了，除了上述方法外，针对非常检测，您还可以考虑利用Ruptures和Prophet等软件包。
当然，为了实现更准确的非常检测，可能须要利用多种方法，例如投票、堆叠或组合。
此外，您可以看到这只是我项目进度的1/5。
在将来的条记本中，我将涵盖以下内容：

离线模型设计 - 如何构建有效的离线模型，特殊是在这个比赛中；构建有效特色 - 如何构建有用的特色，比如若何的周期特色，韶光特色；离线模型设计 - 传统模型和一些测试，具。
体来说是ARIMA，颠簸率波形等一些测试模式识别 - 在数据中识别模式，归类不同的模式分别建模以及利用模式构建特色工程；善用非常结果 - 如何利用检测到的非常来构建特色。
最最末了，创作不易，无业游民，如果以为文章不错！
！
！
点点赞吧！
公众年夜众号！
YB统计大家工智能!!!