Background

The following work is a data analysis project originally submitted to my STAT 445 (Computational Statistics) class midterm. The project earned a perfect score.

Introduction

Fishing is an incredibly important source of food and income for millions of people around the globe. Monitoring fisheries to ensure they are not being depleted too rapidly is an essential consideration to preserve economic and ecological security. One method to track regional ecosystem health is cataloging the age distribution of animal populations. In general, the age of a population in the wild is normally distributed and a significant skew towards young or old members of a population may be a sign that the ecosystem is being overexploited.

Abalone (Haliotis asinina) is a popular seafood at risk of overfishing. Abalone populations are carefully monitored and protected to ensure stocks are kept at healthy levels. Currently, abalone age is determined by cutting and staining its shell and counting the rings within - a time consuming and potentially inaccurate task. An easier and less error-prone method may involve gathering other physical measurements like size and weight of the abalone to elucidate its age.

The “Abalone” UCI dataset was collected by the Marine Resources Division of the Tasmanian Department of Primary Industry and Fisheries. It measures several physical characteristics of abalone shells and tissue including the number of rings in their shells. This data can be used to determine if correlations between ring count and other physical abalone attributes exist.

This project seeks to answer the following question:

Given that the age of an abalone can be determined by the number of rings in its shell, do other physical attributes of abalone such as weight and size sufficiently correlate with age to be used instead of ring count as a measure of abalone population age distribution?

Data Exploration

A summary of the abalone data is provided below:

Length Diameter Height Whole weight Shucked weight Viscera weight Shell weight Rings
Min. 15.0000 11.00000 0.00000 0.4000 0.2000 0.10000 0.30000 1.000000
1st Qu. 90.0000 70.00000 23.00000 88.3000 37.2000 18.70000 26.00000 8.000000
Median 109.0000 85.00000 28.00000 159.9000 67.2000 34.20000 46.80000 9.000000
Mean 104.7984 81.57625 27.90328 165.7484 71.8735 36.11872 47.76617 9.933685
3rd Qu. 123.0000 96.00000 33.00000 230.6000 100.4000 50.60000 65.80000 11.000000
Max. 163.0000 130.00000 226.00000 565.1000 297.6000 152.00000 201.00000 29.000000

The number of male, female, and infant samples in the data is roughly equal, though more males are present which may impact linear models if male abalone physical characteristics are skewed larger or smaller than female or infant abalone.

The abalone size variables are approximately normally distributed with “Diameter” and “Length” skewed toward smaller sizes. The “Height” data is fairly concentrated with low variation save for two extremely large outlier values. These two points were eliminated from the data. The relationship between each predictor and shell ring count is approximately linear.

Like the size variables, the weight variables are approximately normally distributed with “Shucked weight” and “Whole weight” skewed toward larger weights. The relationship between each predictor and shell ring count is approximately linear but a quadratic equation may be more accurate for the “Whole weight” variable.

The abalone shell ring count follows a normal distribution (which is good news for the fishery as this indicates the current population is relatively healthy). The mean number of rings is between 9 and 10 with a positive skew (skew = 1.11) towards higher ring counts.

Correlations between variables were calculated to determine which variables may best predict ring count. Perhaps unsurprisingly, most of the weight variables are very closely correlated with each other, as are all the size measurements to themselves. “Length” and “Diameter” are almost perfectly positively correlated at 0.99. “Whole weight” has a very strong positive correlation with “Viscera weight” and “Shucked weight”, both at 0.97. “Shell weight” has the highest correlation with ring count at 0.63. All the variables were positively correlated with each other meaning when one increases all other variables increase. All correlations were significant with a p-value of less than 0.001.

Linear Regression of Data

A linear model which included all predictors was constructed.

Coefficient Estimate Standard.error Confidence.interval t-value p-value
(Intercept) 2.78 0.268 2.252, 3.3034 10.4 7.8e-25
SexF 0.789 0.102 0.5893, 0.9888 7.74 1.21e-14
SexM 0.845 0.0953 0.6582, 1.0316 8.87 1.08e-18
Length -0.0058 0.009 -0.0234, 0.0119 -0.644 0.52
Diameter 0.0462 0.0111 0.0244, 0.0681 4.15 3.34e-05
Height 0.116 0.0114 0.094, 0.1386 10.2 3.27e-24
Whole.weight 0.0443 0.00361 0.0372, 0.0513 12.3 4.53e-34
Shucked.weight -0.0969 0.00407 -0.1049, -0.089 -23.8 1.28e-117
Viscera.weight -0.0556 0.00644 -0.0682, -0.043 -8.64 8.18e-18
Shell.weight 0.0384 0.00563 0.0274, 0.0495 6.82 1.01e-11

The model summary reported that almost all the variables used in the above regression were significant except for the “Length” variable (with a p-value of 0.52). This may be because “Length” and “Diameter” are so closely related that only one predictor is necessary to convey enough information about shell size. The adjusted R-squared value for this model was 0.5429, meaning approximately 54 percent of the variation observed in the abalone shell ring count can be explained by the model.

Two final linear models were constructed: one which only included predictors that are theoretically noninvasive to collect and another which could use any of the predictors in the data. The models were tweaked using the step() function which selects the ideal combination of predictors in a linear regression model via the Akaike Information Criterion (AIC). Each model was tested to determine how well they predicted population distribution in abalone. The data was randomly split 70/30 into training data (used to construct the model) and testing data (used to test the model).

The two final models were:

# Model 1: Ideal Noninvasive Regression
lm(Rings ~ +Sex +Height +Diameter +Length +Whole.weight, 
   data=training.abalone)

# Model 2: Ideal Non-restricted Regression
lm(Rings ~ +Sex +Height +Diameter +Whole.weight +Shucked.weight +Viscera.weight +Shell.weight, 
   data=training.abalone)

The noninvasive model (row 1) and the non-restricted model (row 2) were tested with the abalone dataset and the resulting predicted values were compared against the true ring counts (row 3).

Model R.squared Minimum First.Quartile Median Mean Third.Quartile Maximum Standard.Deviation Skew
Model 1: Noninvasive 0.3813 3.594 8.593 10.376 10.006 11.521 14.896 2.028 -0.488
Model 2: Non-restricted 0.5423 -3.268 8.345 9.879 9.990 11.425 20.050 2.471 0.358
True population values NA 3.000 8.000 9.000 9.940 11.000 27.000 3.230 1.107

The non-restricted model distribution (red), noninvasive model distribution (green), and true population distribution (blue) are shown below.

Notable results:

Discussion

The goal of this project was to determine whether physical characteristics of abalone could be used to estimate the distribution of shell ring count (and in extension, age) in an abalone population. One of the constructed models only included data which would theoretically be noninvasive to collect, allowing for sampled abalone to be returned after being measured. The other model created for this project was not restricted, meaning it included data which would involve killing the abalone. Both models predicted population distributions which were fairly accurate, though the non-restricted model was more accurate in predicting population skew and standard deviation. This implies that the mass of abalone tissue is more strongly associated with its age than the size of its shell. While many variables would have to be collected to implement this model, often times marine biologists are collecting such data points already. Since counting abalone shell rings is a laborious and time consuming process, it may be advantageous to implement a model similar to one created in this project when conducting a survey of abalone with a large sample size.

Works Cited

“Green lip abalone from Western Australia” by “Thewildnomad” is licensed under CC-BY-SA 4.0.

“Abalone” UCI dataset - https://archive.ics.uci.edu/ml/datasets/Abalone