DEVELOPING RECOMMENDATIONS FOR UNDERTAKING CPUE STANDARDISATION USING OBSERVER PROGRAM DATA

Abundance indices based on nominal CPUE do not take into account confounding factors such as fishing strategy and environmental conditions, that can decouple any underlying abundance signal in the catch rate. As such, the assumption that CPUE is proportional to abundance is frequently violated. CPUE standardisation is one of the common analyses applied. The aims of this paper were to provide a statistical modelling framework for conducting CPUE standardisations using the Observer Program data for bigeye tuna, yellowfin tuna, albacore and southern bluefin tuna, and provide a comparison in the trends between the nominal CPUEs and their standardised indices obtained. The CPUE standardisations were conducted on the Observer Program collected between 2005 and 2007, by applying GLM analysis using the Tweedie distribution. The results suggested that year, area, HBF and bait factors significantly influenced the nominal CPUEs for the four tuna species of interest. Some extreme peaks and troughs in the nominal time series were smoothed in the standardised CPUE time series. The high degree of temporal variability that is still shown in the standardised CPUE trends suggests that the data are too sparse to give any meaningful indication of proxy abundance. Nevertheless, this may also suggest that variables used in the GLMs do not sufficiently account for all of the confounding factors, or abundance may indeed be truly variable.


INTRODUCTION
It is essential to understand temporal trends in its abundance in order to manage a fish population effectively (Ortega-Garcia et al., 2003, Chen et al., 2004, Maunder et al., 2006a).For commercial longline vessels, CPUE data are the main source of abundance information (Maunder & Punt, 2004, Maunder et al., 2006b, Ward & Hindmarsh, 2007, Bigelow et al., 1999) as fishery-independent data are impractical to collect (Bishop, 2006, Maunder et al., 2006b).Abundance indices based on nominal CPUE do not take into account confounding factors such as fishing strategy (Bach et al., 2000) (including fishing power & catchability (W ard, 2008)), and environmental conditions, that can decouple any underlying abundance signal in the catch rate (Polacheck, 1991, Hinton & Nakano, 1996, Hampton et al., 1998).As such, the assumption that CPUE is proportional to abundance is frequently violated (Maunder et al., 2006b) and, in turn, the relative abundance indices based on nominal CPUE data can be misleading (Maunder & Punt, 2004) or even problematic (Beverton & Holt, 1957, Hilborn & Walters, 1991, Walters, 2003, Ortega-Garcia et al., 2003).Thus, effects of those confounding factors need to be statistically filtered out in order to be able to use the time series of CPUE as a proxy of relative abundance with any accuracy (Polacheck, 1991).
A large proportion of zero catch observations for target and non-target species can commonly occur in catch and effort data (Maunder & Punt, 2004).This was found in the Observer Program data (collected from the Indonesian trial Observer Program on longline vessels operating in the Indian Ocean out of Benoa Fishing Port).It is important to include these zeros in the CPUE standardisation in estimating the trends in catch rates and understanding the process behind the trends (Minami et al., 2007).CPUE standardisation using statistical distributions that allow for zero observations has been applied in some cases (e.g.Candy, 2004, Basson & Farley, 2005) by fitting GLM using Tweedie family distributions.The Tweedie distribution can deal with zero values and can sensibly incorporate zero catch data with non-zero catch data into a single model (Candy, 2004).In addition, the Tweedie distribution can accommodate larger ranges of models for count data than the Poisson, Negative Binomial, Zero-inflated Poisson and Zero-inflated Negative Binomial models (Minami et al., 2007).The Tweedie distribution approach was therefore adopted within this paper.
The aims of this paper were to provide a statistical modelling fram ework for conducting CPUE standardisations using the Observer Program data for bigeye tuna, Thunnus obesus (BET), yellowfin tuna, T. albacares (YFT), albacore, T. alalunga (ALB) and southern bluefin tuna, T. maccoyii (SBT), and provide a comparison in the trends between the nominal CPUEs and their standardised indices obtained.Since the Observer Program data set is only a short time series, meaningful temporal trends are not anticipated.However, the exercise of standardisation is valid both in terms of providing a template for undertaking standardisations as long-term Observer Program data become available over time as the Observer Program evolves.
This paper attempts to develop recommendations for ongoing monitoring and analysis, and providing a statistical modelling framework to undertake CPUE standardisations for future data.

METHODS
The standardisation should incorporate all of the extraneous variables influencing CPUE in order to take into account their impacts.The effect of these variables is then eliminated and a standardised value reconstructed that is hoped to be directly proportional to abundance.Obviously the extent to which this can occur is limited by the amount of available data.The CPUE standardisations in this paper were conducted by applying GLM analysis using the Tweedie distribution.For each tuna species, standardised CPUEs were then plotted together with the nominal CPUE.This enabled comparison of the nominal and standardised CPUEs.

Data Overview
A total of 793 set-by-set data span from August 2005 to December 2007 were obtained from Indonesia's Indian Ocean trial Observer Program on tuna longline vessels based at Benoa Fishing Port.41 records were excluded due to incomplete information on fishing techniques and environmental data.The Observer Program data consist of catch and effort data, information on fishing practices, and environmental data (summarised below).

Catch and effort data
Catch and effort data were recorded as the number of fish and the number of hooks recorded per set, respectively.The catch for this fishery consists of four tuna species, BET, YFT, ALB and SBT, and other byproduct species.The analyses in this chapter are only concerned with the four tuna species.Speciesspecific catch (number of fish) was used as the response variable and the log of effort (number of hooks) was assigned as an offset in the GLM analyses.

Fishing practices
Factors considered under the category of targeting strategies include the number of hooks between floats (HBF), the bait species/combination used, the area fished, the start time of the set, and gear characteristics.However, different fishing practices were sometimes used by vessels to target the same species (pers comm.with the observers, 2007).These different practices may result in dissim ilar catchabilities that will confound the nominal CPUE trend (Maunder & Punt, 2004).Thus, incorporating these fishing practices into the GLM analysis is imperative.The following information on fishing strategies was recorded and included in GLM analyses:

a. Fishing area
Fishing position was recorded by latitude and longitude for each set (Figure 1).The fishing area was divided into five subarea delineations (Figure 1, Table 1).Subareas were used to aggregate the small amount of data available, which otherwise resulted in numerous empty cells (i.e. with no fishing activity recorded) when the fishing area was classified by 1 x 1 degree or 5 x 5 degree blocks.1).

d. Vessel identification (Vessel Id)
A Vessel identification factor embraces all attributes of a vessel, such as size, capacity and electronic equipment, and its crew that determine the success of the vessel's fishing activity, such as the ability of the crew to find good fishing grounds and to use the gear efficiently (Campbell & Hobday, 2003).It is worthwhile taking into account the effect of each individual vessel on catch rates.The Vessel Identification factor included in the GLM as a categorical variable (Table 1).

e. Start time of set
The time at which the set commenced was employed to represent fishing time and was taken into account as a categorical variable in the GLM and assigned into 6 levels (Table 1).This assignment of the start time of set was adopted from Campbell & Hobday (2003).

f. Lengths of float line, branch line and main line
In addition to the HBF, the actual fishing depth of the longline is influenced by the lengths of the float line and branch line (Bigelow et al., 2002), and by the length of main line between floats (Suzuki et al., 1977).However, for the Benoa-based longline vessels this latter gear change is impractical to undertake within one trip (pers comm.with the observers, 2007).To eliminate effects of these gear configurations on the nominal CPUEs, these three variables are included as continuous covariates in the GLM analysis (Table 1).

g. Age of main line
Bjordal & Lokkeborg (1996) stated that, generally, new main lines have considerably higher catch rates than used main lines, although the reason for this has never been investigated properly.Therefore, age of the main line was incorporated as a continuous variable in the GLM analysis (Table 1) to take into account its effect (if any) on the catch rates.

Environmental data
Types of environmental data included in the CPUE standardisation are as follows:

a. Phase of the moon
Moon phase information is available as a daily index of moon fraction for all recorded sets and ranges between 0 and 1 (from new moon to full moon).The moon phase was incorporated in the CPUE standardisation as a continuous variable in the GLM analysis.To account for the effect of cyclic behaviour, the moon phase was incorporated as a new variable called "MOON" in the GLM analysis (Table 1), which is defined by the following function where 2 translates the variable into radians and moon phase ranges between 0 and 1.

b. Sea surface temperature (SST)
Sea surface temperature information was calculated using the Spatial Dynamics Ocean Data Explorer (SDODE) in Matlab and was available for each set.To account for a possible non-linear (quadratic) relationship between CPUE and SST, the SST was assigned as a quadratic variable (expressed in R as poly (SST, 2)) and incorporated as a continuous variable in the GLM analysis (Table 1).

c. Sea Conditions
Sea conditions were incorporated using the Beaufort scale (created by Sir Francis Beaufort in 1805) as a continuous variable in the GLM analysis (Table 1).

Generalised linear model
The exploratory variables described previously are summarised in Table 1.The first seven variables were fitted as categorical (factor) variables while variables 8-14 were fitted as continuous (numerical) variables in the GLM model (equation 2).
The catch by each level of each categorical variable was examined to determine level/s of each categorical variable that only have zero catches for the species of interest.These level/s were excluded as being uninformative prior to the GLM analyses for that species (e.g. if there are only zero catches for the species of interest using a certain bait type, then that bait type is excluded).CPUE was defined as the catch, in numbers of fish, per 100 hooks of effort.Since the CPUE is a ratio of two random variables, modelling the distribution of CPUE can be complicated (Candy, 2004).Therefore, catch data and the log of effort were used as the response variable and an offset in the GLM model, respectively, and a log-link function was used (Candy, 2004, Basson & Farley, 2005).The catch data was modelled using the Tweedie distribution or the compound Poisson-Gamma distribution (see Jørgensen (1997) & Candy (2004) for a full explanation of the Tweedie distribution).Subsequently, the catch was modelled using all variables mentioned above as follows (equation 2), referred to as the "full model", hereafter: where c is a constant (intercept), i corresponds to the ith data record, n is the coefficient for the nth variable and e is the error term (normally distributed).Each categorical variable has a separate coefficient value for each level of the variable, with j corresponding to the jth coefficient value for the associated level of the categorical variable.
The Tweedie distribution has a power variance function, with the power parameter (k) (Candy, 2004, Basson & Farley, 2005).Values of k for the Tweedie distributions range between 1 and 2, which is appropriate for zero catch observations (Basson & Farley, 2005).k equals to 0, 1 and 2 associated with normal, Poisson and gamma distributions respectively (Candy, 2004, Basson & Farley, 2005).The first step of the GLM process was to select the value of k (1 < k < 2) using the randomised quantile residual diagnostic.This was done by running the full model (equation 2) for a range of k values between 1 and 2. The value of k with the flattest plot in the Scale-location of the quantile residuals, and the most normally distributed quantile residuals in the normal QQ plot and the histogram of residuals, was chosen.To enable the use of the Tweedie distributions within the GLM framework and to produce the quantile residual diagnostic plots, respectively, the "Tweedie" and "Statmod" functions in R were used.
Once the k-value had been determined, the selection of the best model/s was done using the stepwise AIC (Akaike Information Criterion) in R using the "MASS" package.The best model was the one with the lowest AIC value.Models that are within 5 AIC units of the best model, while yielding qualitatively similar CPUE trends, are also included in a short-list of "best options" (pers comm.with Mark Bravington, Natalie Kelly and Marinelle Basson).A summary table of GLM results for the best and full model is provided, but it should be emphasised that the model selection was done using the stepwise AIC, not the statistics in the summary table, as stepwise AIC is preferable to ANOVA significance to determine the optimal model/ s (pers comm.with Mark Bravington, Natalie Kelly & Marinelle Basson).For the best model, the diagnostic plots were again checked to confirm that no counterintuitive trend were present.These diagnostic plots are presented in Appendix 1.
Interaction terms between year and area, and between quarter and area were trialled to be incorporated in the GLM analysis.However, as a result of a lack of data across all possible quarter-area combinations, the coefficients of the interaction terms were infinite and this resulted in null value of the indices.Therefore, these interaction terms were not included in the final GLM.
The abundance indices for each of the four tuna species were estimated by reconstructing a standardised CPUE value using the "predict" function in R ("Stats" package) on a revised dataset, where those exploratory variables not equal to Year and Quarter were set constant.The constant values chosen for the confounding factors were typically the median value of each of the variables.Nominal CPUEs and standardised indices were normalised relative to their respective grand means in order to yield directly comparable relative values.

RESULTS
The "best model options" for BET, YFT, ALB and SBT as determined according to the stepwise AIC criterion are presented in Table 2-Table 5.The best model -that has the smallest AIC, for each specieswas used to predict the standardised CPUEs (Figure 2).The randomised quantile residual diagnostic for the best model is given in Appendix 1.The results of analyses of variance for the best (for BET, YFT and SBT) models are given in Appendix 2. The results of the best model for each data set are summarised in Table 6.Shaded cells indicate the variables included in the best model for each species.
The nominal CPUE trend for the four tuna species was significantly influenced by different factors associated with fishing practices and/or environmental conditions.For BET, the Area, Bait, Vessel Id, and length of main line covariates were highly significant (p-value <0.1%), followed by the age of the main line and SST (p-value<5%).Area, Quarter, Vessel Id, length of branch line and length of main line were highly significant (p-value <0.1%) for the YFT GLM, followed by the start time of set covariate (p-value<1%).For ALB GLM, Year, Area, Bait, Vessel Id and length of main line were highly significant (p-value <0.1%) followed by length of branch line and MOON (p-value <1%), and then Quarter and SST (p-value <5%).Area, Quarter and Bait had a strongly significant influence on the nominal CPUE for SBT (p-value <0.1%), followed by length of float line and length of branch line (p-value <5%).Area, Quarter, Bait, Vessel Id and length of main line covariates are the most common variables that significantly influenced the nominal CPUEs of the four tuna species in GLMs.2, and the predicted values of the standardised CPUEs and the associated standard errors are given in Table 7.The standardised CPUEs were predicted using the best models for each species (Table 2-Table 5).The nominal and standardised values are re-scaled relative to their respective grand means.Since only a short time series was available, it is difficult to infer any strong temporal or seasonal abundance patterns, but this approach forms a template for undertaking standardisations as a longer time series of data becomes available.
The best models for BET, YFT and SBT (Table 2, Table 3 and Table 5) do not include the Year factor.The next best models that have the lowest AIC while including the Year effect and have the smallest AIC (termed "alternative models" were plotted for those species (not presented in this paper).The standardised CPUEs for 2007 were slightly higher for BET and YFT, and more than double for SBT in any quarters relative to those in 2006.However, the comparison within this chapter used the best model for all species.
Comparing the standardised CPUE trends between species for the GLMs, the standardised time series was relatively stable for BET, YFT and SBT, except for the two spikes for BET, YFT and SBT (in the first quarter of 2006 and 2007) (Figure 2).The ALB standardised time series was variable but consistently higher in quarter 3 of 2006 and 2007.This pattern in the standardised indices had previously been obscured by the fishing practices, most significantly by fishing area, bait, vessel, length of main line and length of branch line (in order of decreasing statistical significance) (Table6).
When comparing nominal and standardised trends, the general effect of the standardisation was to smooth extreme peaks and troughs in the nominal CPUE time series.The spikes and troughs in the BET nominal time series that were smoothed by standardisation were spikes in quarter 3 of 2005, quarters 3 and 4 of 2007, and a trough in quarter 3 of 2006 and quarter 1 of 2007 (Figure 2).For the ALB nominal time series, the spikes in quarter 4 of 2006 and quarter 2 of 2007 were smoothed, such that consistent peaks in quarter 3 became evident in the standardised time series.For YFT, the spikes in quarters 2 and 3 of 2006 were smoothed, but a peak occurred in the standardised indices in quarters 1 and 2 of 2007.(Nishida, 2000) and Japanese, Korean and Taiwanese longline vessels (from 1970-1992) (Nishida, 1995) operating in the Western Indian Ocean.Importantly, by eliminating the effect of fishing strategies and environmental variables from the CPUE signal, some extreme peaks and troughs in the nominal CPUE time series were smoothed in the standardised CPUE time series.The high degree of temporal variability that is still shown in the standardised CPUE trends further suggests that the data are too sparse give an accurate indication of abundance.However, this may also suggest that variables used in the GLMs do not sufficiently account for all of the confounding factors, or the abundance may indeed be truly variable.

19-33
In this current standardisation, several models including interaction terms between year and area, between quarter and area, and between year and quarter were trialled.However, coefficients of the interaction terms were infinite and this resulted in null indices.This again was due to the data scarcity when it was grouped in combination of the two effects (i.e.year and area, quarter and area, and year and quarter).Therefore, these interaction terms were removed from the GLM.As more data become available in future, it will be more feasible to include interaction terms in the GLMs.Such interactions are likely to be significant, given the experience of other researchers working with larger data sets.Maunder & Punt, (2004) stated that interactions among factors commonly occur when standardising catch and effort data (e.g.Okamoto & Miyabe, 1998, Matsumoto, 2000, Nishida, 2000, Okamoto & Shono, 2008), meaning that simple interpretations of the main effect cannot be used as the basis to develop an index of abundance (Maunder & Punt, 2004).
In conducting CPUE standardisations, the short time series currently available for the Observer Program data means it is difficult to infer any strong temporal or seasonal abundance patterns.In addition, given that spatial and fleet coverage limitations of the data set, it should be emphasised the resulting indices would not yield meaningful results if used to inform a stock assessment.Thus the standardised indices presented in this chapter should not be used as input to any stock assessments.Once more data become available, clear temporal and seasonal patterns might become apparent.However, the aim here was to develop a protocol for undertaking standardisations as a longer time series of data becomes available.In the interim, the current standardisation can give some indication of which factors may significantly influence the nominal time series.

CONCLUSION
It was suggested that year, area, HBF and bait factors significantly influenced the nominal CPUEs for the four tuna species of interest.By eliminating the effect of fishing strategies and environmental variables from the nominal CPUE trend, some extreme peaks and troughs in the nominal time series were smoothed in the standardised CPUE time series.The high degree of temporal variability that is still shown in the standardised CPUE trends suggests that the data are too sparse to give any meaningful indication of proxy abundance.Nevertheless, this may also suggest that variables used in the GLMs do not sufficiently account for all of the confounding factors, or abundance may indeed be truly variable.
Appendix 2. Analysis of variance for BET, YFT, ALB and SBT GLMs using the best models.

Table 1 .
All variables (factors and covariates) used in GLM analysis

Table 2 .
List of model option for BET in order of increasing AIC (such that Model 1 is the statistically optimal model).Model option no. 4 is used as an alternative model (with the inclusion of Year factor).

Table 3 .
List of model option for YFT in order of increasing AIC (such that Model 1 is the statistically optimal model).Model option no. 5 is used as an alternative model (with the inclusion of Year factor).

Table 4 .
List of model option for ALB in order of increasing AIC (such that Model 1 is the statistically optimal model).

Table 5 .
List of model option for SBT in order of increasing AIC (such that Model 1 is the statistically optimal model).Model option no. 5 is used as an alternative model (with the inclusion of Year factor).

Table 7 .
Predicted values of standardised CPUEs for BET, YFT, ALB and SBT with its associated standard errors.These standardised CPUEs are from very scarce data, thus they are not for use in any stock assessments.