-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathadvanced_GEV_analysis.py
471 lines (405 loc) · 19 KB
/
advanced_GEV_analysis.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
"""Methods for statistical analysis with a time-dependent GEV distribution.
The main reference for this code is the book “An Introduction to
Statistical Modeling of Extreme Values” by Stuart Coles (2001).
The heart of this library is the function negative_log_likelihood, which
is intended to be used for a maximum likelihood fit of a (possibly
time-dependent) GEV function to extreme values. For convenience, the
function time_dependent_GEV_parameters is provided; it takes almost the
same arguments and returns the three GEV parameters mu, sigma, and xi
corresponding to the given points in time.
Example usage for a time-dependent GEV fit with a linear trend and an
annual oscillation in both parameters (location and scale):
import numpy as np
from scipy import optimize
initial_parameters = [
# mu
20, 1, 1, 1, # average, trend, cosine, sine
# sigma
10, 1, 1, 1, # average, trend, cosine, sine
# xi
0.1,
]
result = optimize.minimize(
negative_log_likelihood,
initial_parameters,
args=(
h_array, # the observed extreme values
t_array, # the times of the events
[AGA.Modifiers.LINEAR_TREND, AGA.Modifiers.SEASONAL_OSCILLATION],
[AGA.Modifiers.LINEAR_TREND, AGA.Modifiers.SEASONAL_OSCILLATION],
),
)
if not result.success:
print("Warning:", result.message)
# Get the fitted parameters (array of the same size as initial_parameters)
fit_params = result_GEV.x
# Get the standard deviation corresponding to each fitted parameter
fit_errors = np.sqrt(np.diag(result.hess_inv))
For a time-independent GEV fit, it simplifies to:
initial_parameters = [20, 10, 0.1] # mu, sigma, xi
result = optimize.minimize(
negative_log_likelihood, initial_parameters, args=(h_array,)
)
According to Coles (2001), the shape parameter should be bigger than
-0.5 for the maximum likelihood fit to work. If the fit fails, one can
try to impose this as a strict lower bound by giving the argument
bounds=[
(None, None),
...,
(None, None),
(-0.5, None),
]
with the appropriate number of “(None, None)”s (number of
initial_parameters minus 1) to the minimize function. This changes
automatically the fit method, in which case it is necessary to use
fit_errors = np.sqrt(np.diag(result.hess_inv.todense()))
to obtain the standard deviations.
If the fit fails because the desired error was not necessarily achieved
due to precision loss, then look at the Jacobian (result.jac) and impose
an appropriate error tolerance by giving the argument, e.g.,
options={"gtol": 1e-4}
to the minimize function.
Possible extensions and improvements:
-- using two climate indices at the same time
-- make t_array optional if the model depends only on a climate index
-- The performance of a maximum likelihood fit could be improved by
implementing the negative_log_likelihood function in a different
way: Instead of passing a t_array and modifiers for mu and sigma,
pass directly the arrays that are multiplied by the parameters and
added together to form the parameters. This can be, currently, the
linear t-data, the t-data in a cos and in a sin, and the climate
indices. To implement this conveniently, write a new function that
takes as arguments t_array, climate_index_array, mu_modifiers,
sigma_modifiers; and returns the arrays that you have to give as
arrays to the new negative_log_likelihood. This would also make the
implementation of the climate index smoother. In a first approach,
one can limit this to the case of equal modifiers for mu and sigma,
in which case it is only necessary to pass the arrays once. In the
more general case of arbitrary time-dependencies of mu and sigma,
this would have to be made a bit more complicated, or the arrays
would have to be passed separately for the two parameters. However,
with this idea it would not be possible anymore to take a seasonal
oscillation with a phase.
Written by Markus Reinert, August 2020, February 2021.
Extended by Marvin Lorenz and Markus Reinert, March 2022.
"""
import calendar
from enum import Enum
import numpy as np
from scipy import stats
# Minimum value of xi (in magnitude) for which GEV is used instead of Gumbel
XI_THRESHOLD = 1e-10
# Frequency corresponding to a seasonal oscillation in the unit 1/second
SEASONAL_FREQUENCY = 2 * np.pi / (3600 * 24 * 365.2425)
# Scale factor applied to the trend for converting from cm/second to cm/century
TREND_SCALE = 3600 * 24 * 365.2425 * 100
class Modifiers(Enum):
"""Modifiers that can be used to make the GEV model time-dependent."""
LINEAR_TREND = "linear trend"
QUADRATIC_TREND = "quadratic trend"
SEASONAL_OSCILLATION = "seasonal oscillation" # also called "annual cycle"
SEASONAL_WITH_PHASE = "seasonal with phase"
SEMIANNUAL_CYCLE = "semiannual cycle"
CLIMATE_INDEX = "climate index"
def negative_log_likelihood(
params, z_array, t_array=None, mu_modifiers=[], sigma_modifiers=[], fixed_xi=None,
climate_index_array=None, verbose=False,
) -> float:
"""Calculate the negative log-likelihood of ‘z_array’ being GEV distributed.
The parameters of the GEV distribution are given in ‘params’ in the
order mu, sigma, xi, corresponding to location, scale, and shape
parameter, respectively. Alternatively, the shape can be specified
with the parameter ‘fixed_xi’, which is especially useful in the
context of optimisation, since the shape parameter can be difficult
to estimate. Using fixed_xi=0 limits the GEV family of
distributions to the special case of a Gumbel distribution.
The location and scale parameters of the GEV distribution can
dependend on time and/or a climate index. For this purpose, provide
a ‘t_array’ corresponding to the ‘z_array’, as well as modifiers for
mu and/or sigma (the t_array is necessary, even if there is only a
dependence on the climate index). In this case, the list of
parameters must be extended by the necessary coefficients (one for a
linear trend or a climate index, two for a seasonal oscillation) in
the order corresponding to that of the modifiers.
If an automatic optimisation of this function does not work, use
verbose=True to see the params with which this function is called.
If the GEV parameters and the z_array form an impossible
combination, then the likelihood is zero, so the log-likelihood is
-infinity and the negative log-likelihood is +infinity. In this
case, np.infty is returned.
Since this function may return np.infty, scipy's optimize may raise
a “RuntimeWarning: invalid value encountered in subtract”. This
warning can be suppressed by wrapping the call to optimize in “with
np.errstate(invalid="ignore"):”. This warning does not appear when
np.infty is replaced by a very large number, e.g. 1e200.
"""
if verbose:
print(params)
# Check whether it is a normal or a time-dependent GEV and get its parameters
if t_array is None:
if fixed_xi is None:
mu, sigma, xi = params
else:
mu, sigma = params
xi = fixed_xi
if sigma < 0:
return np.infty
else:
mu, sigma, xi = time_dependent_GEV_parameters(
params, t_array, mu_modifiers, sigma_modifiers, fixed_xi, climate_index_array
)
if sigma_modifiers and any(sigma < 0):
return np.infty
elif not sigma_modifiers and sigma < 0:
return np.infty
# Calculate and return the negative log-likelihood
if abs(xi) < XI_THRESHOLD:
# Gumbel
return np.sum(
np.log(sigma) + (z_array - mu) / sigma + np.exp(-((z_array - mu) / sigma))
) # Synthesis of Equations 3.9 and 6.5 of Coles (2001)
else:
# GEV
# Check if any point falls out of the domain of the distribution
if all(1 + xi * (z_array - mu) / sigma > 0): # Equation 3.8 of Coles, 2001
return np.sum(
np.log(sigma)
+ (1 + 1/xi) * np.log(1 + xi * (z_array - mu) / sigma)
+ (1 + xi * (z_array - mu) / sigma) ** (-1/xi)
) # Equation 6.5 of Coles (2001)
else:
return np.infty
def time_dependent_GEV_parameters(
params, t_array, mu_modifiers=[], sigma_modifiers=[], fixed_xi=None,
climate_index_array=None,
) -> "3-tuple":
"""Calculate the three GEV parameters for the given time dependence.
See negative_log_likelihood for more information on the arguments.
"""
i = 0
# Parse mu
mu = params[i]
i += 1
for modifier in mu_modifiers:
if modifier == Modifiers.LINEAR_TREND:
mu += params[i] * t_array / TREND_SCALE
i += 1
elif modifier == Modifiers.QUADRATIC_TREND:
mu += params[i] * (t_array / TREND_SCALE) ** 2
i += 1
elif modifier == Modifiers.SEASONAL_OSCILLATION:
mu += params[i] * np.cos(t_array * SEASONAL_FREQUENCY)
i += 1
mu += params[i] * np.sin(t_array * SEASONAL_FREQUENCY)
i += 1
elif modifier == Modifiers.SEASONAL_WITH_PHASE:
mu += params[i] * np.cos(t_array * SEASONAL_FREQUENCY + params[i+1])
i += 2
elif modifier == Modifiers.SEMIANNUAL_CYCLE:
mu += params[i] * np.cos(t_array * 2 * SEASONAL_FREQUENCY)
i += 1
mu += params[i] * np.sin(t_array * 2 * SEASONAL_FREQUENCY)
i += 1
elif modifier == Modifiers.CLIMATE_INDEX:
mu += params[i] * climate_index_array
i += 1
else:
raise NotImplementedError("unknown modifier {!r} for mu".format(modifier))
# Parse sigma
sigma = params[i]
i += 1
for modifier in sigma_modifiers:
if modifier == Modifiers.LINEAR_TREND:
sigma += params[i] * t_array / TREND_SCALE
i += 1
elif modifier == Modifiers.QUADRATIC_TREND:
sigma += params[i] * (t_array / TREND_SCALE) ** 2
i += 1
elif modifier == Modifiers.SEASONAL_OSCILLATION:
sigma += params[i] * np.cos(t_array * SEASONAL_FREQUENCY)
i += 1
sigma += params[i] * np.sin(t_array * SEASONAL_FREQUENCY)
i += 1
elif modifier == Modifiers.SEASONAL_WITH_PHASE:
sigma += params[i] * np.cos(t_array * SEASONAL_FREQUENCY + params[i+1])
i += 2
elif modifier == Modifiers.SEMIANNUAL_CYCLE:
sigma += params[i] * np.cos(t_array * 2 * SEASONAL_FREQUENCY)
i += 1
sigma += params[i] * np.sin(t_array * 2 * SEASONAL_FREQUENCY)
i += 1
elif modifier == Modifiers.CLIMATE_INDEX:
sigma += params[i] * climate_index_array
i += 1
else:
raise NotImplementedError("unknown modifier {!r} for sigma".format(modifier))
# Parse xi
if fixed_xi is None:
xi = params[i]
i += 1
else:
xi = fixed_xi
# Check that all parameters have been parsed
if len(params) > i:
raise ValueError("expected {} parameters but {} were given".format(i, len(params)))
return mu, sigma, xi
def GEV(x, mu, sigma, xi):
"""Cumulative distribution function (CDF) of a GEV."""
if abs(xi) < XI_THRESHOLD:
# Use a Gumbel distribution
return np.exp(-np.exp(-((x - mu) / beta)))
elif xi > 0:
# For xi > 0, the support of GEV has the lower bound mu-sigma/xi
x_min = mu - sigma / xi
y = np.zeros_like(x, dtype=float)
y[x <= x_min] = 0
y[x > x_min] = np.exp(-((1 + xi * ((x[x > x_min] - mu) / sigma)) ** (-1 / xi)))
return y
else:
# For xi < 0, the support of GEV has the upper bound mu-sigma/xi
x_max = mu - sigma / xi
y = np.zeros_like(x, dtype=float)
y[x < x_max] = np.exp(-((1 + xi * ((x[x < x_max] - mu) / sigma)) ** (-1 / xi)))
y[x >= x_max] = 1
return y
def GEV_return_level(t, mu, sigma, xi, values_per_year=1.0):
"""Compute return level z for return period t of a GEV distribution.
A random variable with the distribution GEV(mu, sigma, xi) will
typically reach a value as high as z once every t years, where z is
the value returned by this function.
That means, for any t >= 1, the following call returns t:
return_period(GEV(GEV_return_level(t, mu, sigma, xi), mu, sigma, xi))
where return_period(p) = 1 / (1 - p).
If there is more than one value per year, i.e., not annual maxima
but for example monthly maxima are used, then the optional argument
values_per_year specifies how many values there are on average per
year (12 in the case of monthly maxima, or less if individual months
are missing). In this case, the return_period (see above) is
defined as return_period(p) = 1 / (1 - p**values_per_year).
Reference: Equations (3.4) and (3.10) of Coles (2001)
"""
p = 1 / t
if values_per_year != 1.0:
# If there is more than one value per year, we need to correct p
p = 1 - (1 - p)**(1/values_per_year)
if abs(xi) < XI_THRESHOLD:
# Gumbel distribution
return mu - sigma * np.log(-np.log(1 - p))
else:
# non-Gumbel GEV distribution
return mu - sigma / xi * (1 - (-np.log(1 - p)) ** (-xi))
def GEV_standard_error(t, mu, sigma, xi, V, values_per_year=1.0):
"""Compute the standard error of the return level for return period t.
When the distribution of a random variable is estimated to be
GEV(mu, sigma, xi) with the variance-covariance matrix V, then the
uncertainty associated with the return level z that belongs to the
return period t (see function GEV_return_level) is the value
returned by this function. The upper (resp. lower) bound of the 95%
confidence interval can be approximated by adding (subtracting) 1.96
times the value returned by this function to z.
The main computation in this function is grad_z_p * V * grad_z_p,
where * is a matrix multiplication if grad_z_p is 1-dimensional.
This is basically a scalar product between grad_z_p and V * grad_z_p
and can be written as <grad_z_p|V|grad_z_p> in the bra-ket notation
occasionally used in physics. The square root of this product is
returned by this function, in order to have the standard error and
not the variance.
This function can be used with t as a vector instead of a single
value, in which case a vector of the same size as t is returned.
Note that in this case, grad_z_p is a matrix instead of a vector, so
grad_z_p * V * grad_z_p differs from regular matrix multiplication.
Reference: Equation (3.11) of Coles (2001)
"""
p = 1 / t
if values_per_year != 1.0:
# If there is more than one value per year, we need to correct p
p = 1 - (1 - p)**(1/values_per_year)
y_p = -np.log(1 - p)
grad_z_p = np.array([
np.ones_like(y_p),
-xi**(-1) * (1 - y_p**(-xi)),
sigma * xi**(-2) * (1 - y_p**(-xi)) - sigma * xi**(-1) * y_p**(-xi) * np.log(y_p),
])
return np.sqrt(np.sum(grad_z_p * np.dot(V, grad_z_p), axis=0))
def format_GEV_parameters(parameters, errors, join_str=", "):
"""Create a string that contains the GEV parameters in a nice format."""
names = ["\\mu", "\\sigma", "\\xi"]
# Write the error with one significant digit
digits = [int(-np.floor(np.log10(e))) if e < 1 else 0 for e in errors]
return join_str.join(
"${name} = {param:.{digits}f} \\pm {std:.{digits}f}$".format(
name=n, param=p, std=e, digits=d
) for n, p, e, d in zip(names, parameters, errors, digits)
)
def get_year_selection(year, time_array):
"""Get the Boolean array that selects ‘year’ from ‘time_array’.
The time in ‘time_array’ must be in seconds since the Epoch.
This function is useful for the selection of annual maxima from a
time series.
"""
t_start = calendar.timegm((year, 1, 1, 0, 0, 0, -1, -1, -1))
t_end = calendar.timegm((year+1, 1, 1, 0, 0, 0, -1, -1, -1))
return (time_array >= t_start) & (time_array < t_end)
def get_month_selection(year, month, time_array):
"""Get the Boolean array that selects ‘year’-‘month’ from ‘time_array’.
The month must be between 1 and 12.
The time in ‘time_array’ must be in seconds since the Epoch.
This function is useful for the selection of monthly maxima from a
time series.
"""
t_start = calendar.timegm((year, month, 1, 0, 0, 0, -1, -1, -1))
if month == 12:
t_end = calendar.timegm((year+1, 1, 1, 0, 0, 0, -1, -1, -1))
else:
t_end = calendar.timegm((year, month+1, 1, 0, 0, 0, -1, -1, -1))
return (time_array >= t_start) & (time_array < t_end)
def check_significance(D, k, alpha=0.05):
"""Perform a test for significance based on the deviance statistic.
The test based on the deviance statistic as described by Coles
(2001) compares the log-likelihoods of two statistical models, of
which one generalises the other. That means, the more complex model
must have the same parameters as the simpler model, and one or
several more. Thus the more complex model has always a higher (or
equal) likelihood than the simpler model. The increase of
log-likelihood is compared to a chi-squared distribution to check if
the complex model significantly improves the simpler model.
The result of the statistical test is printed as a text and returned
as either True (increase in likelihood is significant) or False
(more complex model is not significantly better).
Input parameters (all non-negative):
D: 2-times the difference of log-likelihood,
k: difference in the number of parameters,
alpha: significance level.
"""
print(
"Does the more complex model significantly improve the simpler model (alpha = {})?"
.format(alpha)
)
D_min = stats.chi2.ppf(1 - alpha, df=k)
if D > D_min:
print("Yes! D = {:.2f} > {:.2f}".format(D, D_min))
return True
else:
print("No. D = {:.2f}, but must be more than {:.2f} for significance".format(D, D_min))
return False
def compute_amplitude_and_phase(coeff_cos, coeff_sin, std_cos, std_sin):
"""Compute amplitude and phase of a*cos(t) + b*sin(t).
This function can be used to convert an expression of the form
a*cos(t) + b*sin(t) to amp * cos(t - phase).
The input parameters are coeff_cos (=a) and coeff_sin (=b) and their
respective standard errors.
The return value is (amp, phase, std_amp, std_phase), where:
- amp is the amplitude sqrt(a**2 + b**2),
- phase is such that tan(phase) = b/a,
- std_amp is the standard error of the amplitude,
- std_phase is the standard error of the phase.
"""
amp = np.sqrt(coeff_cos ** 2 + coeff_sin ** 2)
phase = -np.arctan2(-coeff_sin, coeff_cos)
std_amp = np.sqrt(
(std_cos * coeff_cos / amp) ** 2 + (std_sin * coeff_sin / amp) ** 2
)
std_phase = np.sqrt(
(std_cos * coeff_sin / amp**2)**2 + (std_sin * coeff_cos / amp**2)**2
)
return amp, phase, std_amp, std_phase