You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Go to file
blubblub 081e891602 remove unused plotly import 3 months ago
cpt removed bar plot figure from commit 3 years ago
examples Upload event catalog for publication 3 years ago
helper_functions dummy example for map export 11 months ago
src remove unused plotly import 3 months ago
LICENSE.md license and warrenty 1 year ago
README.md new veryion numver 4 months ago
setup.py scikit dependendency to final minor version 0.21.3 1 year ago

README.md

CLUSTY

Clusty is a toolbox for the clustering of earthquakes based on waveform similarity observed across a network of seismic stations.

Application example and reference : Petersen, Niemz, Cesca, Mouslopoulou, Bocchini (2021): Clusty, the waveform-based network similarity clustering toolbox: concept and application to image complex faulting offshore Zakynthos (Greece), GJI, Volume 224, Issue 3, Pages 20442059, https://doi.org/10.1093/gji/ggaa568

Input data: Seismic waveforms, station meta data, (optional: picks)

This manual is still in preparation, please send us an email if you need further help or have suggestions: gesap@gfz-potsdam.de or pniemz@gfz-potsdam.de

prerequisits:

Installation:

(system wide, from source)

git clone https://git.pyrocko.org/clusty/clusty.git
cd clusty
sudo python setup.py install

To check if the installation was successfull, run

  clusty

in your terminal. A welcome and help message should appear. :-)

Basic commands to run clusty:

A full run can be started using:

  clusty --config CONFIG_FILE --run --log_level LOGLEVEL

log_level argument can be: DEBUG, INFO, ERROR, WARNING

However, we recommend to run the tool bit by bit, using the following options:

--cc to compute cross correlations, --netsim to compute the network similarity, --eps_analysis to obtain insight into dbscan parameter settings, --cluster to cluster the earthquakes based on the precomputed network similarity, --plot_results to obtain result plots, --merge_freq_results to merge clustering results obtaine in different frequency ranges or --export_stacked_traces to export stacked waveforms for each cluster. Default is that the exported aligned and stacked traces are preprocessed in the same way as for the clustering (downsampling and bandpass-filtering). Use the option preprocess_export_stacked_traces: false in the !clusty.config.clustering_settings to turn off preprocessing; but be aware that the time shifts were obtained after filtering and downsampling.

Input:

  • catalog file in pyrocko format
  • station file in pyrocko format
  • picks (optional) in pyrocko format or xml
  • waveform data - the station code STATION should be in waveform filename (see also helper_functions/data_download.py for fdsn download example to this format or contact us for help in converting into this format from continous data)

Hints:

  • Currently, clusty can work with S or Love (L) phases on N and E component and P or Rayleigh waves on Z component. Other component names, like e.g. HH1 and HH2, are currently not supported.
  • Clusty ignores location codes, because these are often used in an aribtrary way. E.g. they may change for a station over time. Clusty assumes if a station has code NET.STA.00 for some time period and NET.STA.01 for another, that this is the same station and will cross-correlate events from both time periods.

Example configuration file:

A basic config file is created by running the command clusty --init. Settings need to be adjusted afterwards.

Values given here indicate those values that we used for the study of the aftershock sequence of the Zakynthos Oct. 2018 Mw 6.9 event.

USE OF SI UNITS REQUIRED: SECOND, METER, HERTZ

  --- !clusty.config.clusty_config
settings:

# general settings
- !clusty.config.GeneralSettings
  # choose the number of cores for parallelized cc calculation
  n_workers: 1

  # set paths
  # if you get some error when initializing clusty, check your paths first
  # it is often the path...
  work_dir: ./
  catalog_file: path/to/catalog
  waveform_dir: path/to/waveforms
  station_file: path/to/stationfile

  # min. and max. distance between stations and events [m]
  station_subset:  
    maxdist_stations: 200e3
    mindist_stations: 50e3

# settings for the cross correlation calculation
- !clusty.config.cc_comp_settings

  # order, fmin (corner highpass [Hz]), fmax (corner lowpass [Hz])
  bp: [3.0, 0.05, 0.2] 

  # define the time window for application of bp filter and downsampling. 
  # this is NOT the window for cc computation.
  # adjust if you have shorter trace snippets only... 
  filtertmin: 120  # seconds before
  filtertmax: 600  # seconds after
  
  # sample interval in seconds 
  # larger sample intervals result in faster cc calculation
  # depending on the corner frequency of the lowpass: 1/downsample > 2*fmax
  downsample_to: 0.1  # [s]

  # TIME WINDOW FOR cc computation
  # provide one time window pair for each phase
  # seconds relative to arrival time
  #  -- sorry, that's an inconsistency in the code
  #  -- but kept for compatibility reasons
  #  -- first time should be NEGATIVE
  # if the time window is not set (commented out), time window is set automatically
  # (see below - Methods (4))
  twd: [[-2,5],[-2,10]]

  # Experimental mode for anticorrelations
  # set to 'only' [look for minima in cc function only] or 'both' [look for absolute maximum]
  # 'only':
  #   - the resulting cc-matrices will contain only negative values
  #   - converted to positive values in netsim
  # 'both':
  #   - the resulting cc-matrices will contain negative and positive values
  #   - converted to all positive values in netsim
  # anticor_mode: only

  # by default, one file per event in directory pick_path, 
  # set path of 'pyr_file' for a single pyrocko pick containing all picks 
  # set path to 'xml' in case of an event xml file with picks
  pick_path: ''  # optional
  pick_option: 'pyr_file' # optional

  # set phase(s) to be used 
  # choose between surface ([R]ayleigh, [L]ove) and body waves ([P],[S])
  # adjust the bandpass filter accordingly
  phase: [R, L]

  # chose components, must match your input meta data
  # use x as wild card: [xxZ, XXN, xxE] or [xHZ, xHN, XHE]
  components: [HHZ, HHE, HHN]  
 
  # chose method for arrival calculation
  # True. computes arrivals with velocity models
  # False: extract arrivals from picks
  compute_arrivals: true  

  # set coordinate_system to 'plane' if using a euclidean local system [default = 'spherical']
  # if a local/plane coordinate system is used the station file must be provided as yaml file
  # with north_shift and east_shift 
  coordinate_system: spherical

  # set velocity model path
  # if vmodel is '' and no picks are provided, clusty will try to use the
  # crust2x2 model for the source area
  vmodel: path/to/velocity-model

  # indicate if cross-correlations are already computed, e.g from previous runs
  use_precalc_ccs: false  

  # indicate weather signal-to-noise ratio(SNR) should be computed for each trace
  snr_calc: true 

  # SNR threshold that needs to be passed, so the particular trace is used
  snr_thresh: 2.0

  # maximum inter event distance [m]
  max_dist: 30e3

  # enable/disable debug modes
  # opens interactive waveform browser to check time window and filter settings, for testing only
  # to abort debug mode use crtl-c or crtl-AltGr-\ and rerun with debug modes turned off...
  debug_mode: false 
  debug_mode_S: false
 
# settings for the calculation of the network similarity
- !clusty.config.network_similarity_settings

  # indicate if a station weighting should be applied. in case of uneven station distribution
  get_station_weights: false  
  
  # set network similarity calculation method
  # methods: trimmed_mean_ccs, median_ccs, mean_ccs, max_cc, product, weighted_sum_c_diff
  # (see Petersen & Niemz et al. 2020)
  method_similarity_computation: trimmed_mean_ccs  
 
  # boolean value to indicate whether network similarity matrix is already computed
  use_precalc_net_sim: false  
  
  # parameter for trimmed mean method 
  # cut off ratio stations with the lowest cross correlation coefficient 
  # corresponds to a rejection of 30%.
  # ATTENTION: The cut-off index is calculated using the FLOOR function {}.
  # Be aware that this might lead to unexpectately low number of trimmed cc values when
  # only a few stations are used.
  # Example: Having 6 stations and trimm_cut 0.3 will only cut of 1 value [{1.8} = 1], not 2 as expected
  # when rounding. Choose trimm_cut according to the number of available stations, especially if there
  # are usually less then 10 stations recording the events.

  trimm_cut: 0.3

  # set the cross-correlation based thresholds (see Petersen & Niemz et al. 2020)
  # cc_tresh: min. cc to be met at ```min_n_stats``` stations covering an azimuthal
  # range of at least ```az_thresh``` degrees to consider an event for clustering
  apply_cc_station_thresh: true 
  cc_thresh: 0.7  
  min_n_stats: 5 
  az_thresh: 60

  # enable combination of components, if disabled clusty provides
  # results for each component separately
  combine_components: true
  
  # weightings for components, same order as in components list above
  weights_components: [0.4, 0.3, 0.3]  
  
  # plotting with 'plotly' or 'plt'/'matplotlib' or set False to skip fruchtermann/connectivity plot
  frucht_plot_lib: 'plotly'

# setting for the clustering procedure
- !clusty.config.clustering_settings
  #set method (right now only ```dbscan``` available)
  method: dbscan

  # set dbscan parameters
  # two list are required: ```eps``` and ```min_pts```
  # range of eps values to get started, use smaller steps in a smaller range after finding a rough best value
  # dbscan min pts value = [5] is usually appropriate
  # see Petersen & Niemz et al, 2020, for details 
  dbscan_eps: [0.08, 0.12, 0.14, 0.16, 0.18, 0.2]  
  dbscan_min_pts: [5]

  # enable/disable network similarity matrix plots
  plot_netsim_matrix: False


  # plot cluster results on a map
  plot_map: True

  # choose map plotting library (gmt or cartopy)
  map_plotlib: cartopy

  # set the logarithmic magnitude scaling (circle size) for the EQs on the map with tuple (a,b)
  # markersize = a**magnitude/b
  # comment out to use automated scaling instead
  mag_scale_log: [1.75, 1.8]

  # provide a list of tuples in ```wf_plot```: [(eps1, minpts1),(eps2,minpts1),...] for plotting
  # stacked waveforms of clusters for the given clustering parameters
  wf_plot: []

  # chose stations to be plotted with```wf_plot_stats```: [net.stat,net.stat2,...]
  wf_plot_stats: []

  # clustered waveforms in snuffler
  # ---> much nicer than the static wf_plot...
  # use syntax of ```wf_plot```
  wf_cl_snuffle: []

  # chose clusters to be plotted with```wf_cluster_choice```: e.g. [1,5,20]
  wf_cluster_choice: []

  # if your traces are very short you might have to choose another window length 
  # for plotting (default = 20 should be fine otherwise) 
  t_buffer_wfplot: 20.

  # set to true if stacked traces are needed (slow) 
  export_stacked_traces: false  
  
  # debug option for the stacking function
  debug_stacking: false


METHODS

(1) Cross-correlation Stacking Methods

  • max_cc:

    • maximum cc of all stations for one event pair used as network similarity --> only for testing, time window and filter selection etc.
  • median_ccs (Scott and Ater, 1993):

    • can be used as a proxy for network similarity, but problem if wide magnitude range: for smaller events only the closest stations can record event --> use median of all that pass snr ratio
  • weighted_sum_c_diff:

    • sum of cc at all stations; weighted by difference of first to second cc-maxima
  • product and product_combPS (Stuermer et al. 2011):

    • single phase: combine stations as n-th root of product of cc at all (n) stations
    • P and S: multiplication of P and S products, then 2nth- root
  • mean and trimmed mean (Maurer Deichmann 1995):

    • trimmed: lowest values removed before mean calculation, k percent of stations removed before net sim is computed as mean of remaining stations. 'k must be determined by trial and error'. The cut-off index is calculated using the floor function. ATTENTION: Be aware that this might lead to unexpectately low number of trimmed cc values. Having 6 stations and k = 0.3 will only cut of 1 value (seetrimm_cut parameter).
  • combine P and S:

    • mean of P and S network similarity after using above methods for computing P and S net sim independently. P and S can be weighted. waveform differences can be more distinct on S phases.
  • stacking in different freuqncy bands (Souberster 2017)

(2) Station weighting methods

  • based on az. station position; only implemented for stachking method "weighted_sum_c_diff"

(3) DBScan

  • using implementation in scikit-learn with our precalculated distance matrice: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011. https://scikit-learn.org/stable/modules/clustering.html#dbscan

  • DBSCAN (Ester et al., 1996): Clusters can have any shape, based on densities. The number of clusters are not predefined. Two samples belong to one cluster, if their distance is less than eps. core points are objects, that have at least min_samples within the distance eps. A point p is directly reachable from a point q, if it is within distance eps, and reachable from point q if there is a path connecting p and q via directly connected points. Points, that do not lie within distance eps of any other point do not belong to any cluster (outliers/ noise). All points within a cluster are density connected and any point that is density-reachable to any point of the cluster is part of the cluster.

(4) Default time windows

Make sure to use meaningful time windows! The below defaults only apply if twd is not set in the config file:

(a) P and S phases:

  • P: start: P-arrival -2s; stop: S-arrival-2s
  • S: start: S-arrival -2s; stop: S-arrival + 1.5* (S-P time)

(b) R and L phases:

  • both: start: 10s before arrival; end: arrival + 3.* 1./(cc_settings.bp[1])) + 10
  • → this should ensure a sufficiently long time window for surface phases based on the lower corner frequency of the BP filter

(c) single phases:

  • P: start: P-arrival -2s; end: P-arrival + 15s

  • S: start: S-arrival; end: S-arrival + 15 s

  • → Be very careful with the time windows for the single phases, the length of a meaningful time window strongly depends on the distance between station and event and the chosen frequency range. Especially the default settings for the single phases may not be meaningful at all.

  • R or L: start: 10s before arrival; end: arrival + 3.* 1./(cc_settings.bp[1])) + 10

  • →this should ensure a sufficiently long time window for surface phases based on the lower corner frequency of the BP filter

Please sent us a message if you have ideas for other automated approaches to set time windows. The currently implemented rules are based on experience with the datasets we have been working with...

Clusty & HypoDD:

To use the cross-correlation matrices as input to hypoDD (Waldhauser et al., 2000), have a look at the helper_functions directory.

values used in Zakynthos study:

same settings for all methods:

  • SNR > 2.0
  • ccmin of 0.7 met at above 5 stations covering > 60 deg az.
  • HHZ (0.4), HHE (0.3), HHN (0.3)
  • MinPts = 5 (tested 3-8)
  • Eps range tested: 0.03 - 0.30

method: TRIMMED MEAN

  • trim: 0.3 (tested also 0.1 and 0.2)
  • 0.05-0.2 Hz --> eps = 0.13
  • 0.02-0.15 Hz --> eps = 0.15
  • 0.1 - 0.5 Hz --> eps = 0.24
  • 0.2 - 1 Hz --> eps = 0.26

method: Weighted sum, weighting based on difference between first and second cc-function max.

  • MinPts = 5
  • 0.02-0.15 Hz --> eps = 0.17

method: products

  • MinPts = 5
  • 0.02-0.15 Hz --> eps = 0.20

method: mean cc of all stations (no trimming)

  • MinPts = 5
  • 0.02-0.15 Hz --> eps = 0.15

method: median cc of all stations

  • MinPts = 5
  • 0.02-0.15 Hz --> eps = 0.15

method: max. cc of all stations

  • MinPts = 5
  • 0.02-0.15 Hz --> eps = 0.05
  • note that this method only requires a large cc value at a single station. Results are therefore not represetative for entire mechanism...

License & Warranty

GNU General Public License, Version 3, 29 June 2007

Clusty is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Clusty is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.