The following work is a data analysis project originally submitted to my STAT 445 (Computational Statistics) class midterm. The project earned a perfect score.
Fishing is an incredibly important source of food and income for millions of people around the globe. Monitoring fisheries to ensure they are not being depleted too rapidly is an essential consideration to preserve economic and ecological security. One method to track regional ecosystem health is cataloging the age distribution of animal populations. In general, the age of a population in the wild is normally distributed and a significant skew towards young or old members of a population may be a sign that the ecosystem is being overexploited.
Abalone (Haliotis asinina) is a popular seafood at risk of overfishing. Abalone populations are carefully monitored and protected to ensure stocks are kept at healthy levels. Currently, abalone age is determined by cutting and staining its shell and counting the rings within - a time consuming and potentially inaccurate task. An easier and less error-prone method may involve gathering other physical measurements like size and weight of the abalone to elucidate its age.
The “Abalone” UCI dataset was collected by the Marine Resources Division of the Tasmanian Department of Primary Industry and Fisheries. It measures several physical characteristics of abalone shells and tissue including the number of rings in their shells. This data can be used to determine if correlations between ring count and other physical abalone attributes exist.
This project seeks to answer the following question:
Given that the age of an abalone can be determined by the number of rings in its shell, do other physical attributes of abalone such as weight and size sufficiently correlate with age to be used instead of ring count as a measure of abalone population age distribution?
A summary of the abalone data is provided below:
| Length | Diameter | Height | Whole weight | Shucked weight | Viscera weight | Shell weight | Rings | |
|---|---|---|---|---|---|---|---|---|
| Min. | 15.0000 | 11.00000 | 0.00000 | 0.4000 | 0.2000 | 0.10000 | 0.30000 | 1.000000 |
| 1st Qu. | 90.0000 | 70.00000 | 23.00000 | 88.3000 | 37.2000 | 18.70000 | 26.00000 | 8.000000 |
| Median | 109.0000 | 85.00000 | 28.00000 | 159.9000 | 67.2000 | 34.20000 | 46.80000 | 9.000000 |
| Mean | 104.7984 | 81.57625 | 27.90328 | 165.7484 | 71.8735 | 36.11872 | 47.76617 | 9.933685 |
| 3rd Qu. | 123.0000 | 96.00000 | 33.00000 | 230.6000 | 100.4000 | 50.60000 | 65.80000 | 11.000000 |
| Max. | 163.0000 | 130.00000 | 226.00000 | 565.1000 | 297.6000 | 152.00000 | 201.00000 | 29.000000 |
The number of male, female, and infant samples in the data is roughly equal, though more males are present which may impact linear models if male abalone physical characteristics are skewed larger or smaller than female or infant abalone.
The abalone size variables are approximately normally distributed with “Diameter” and “Length” skewed toward smaller sizes. The “Height” data is fairly concentrated with low variation save for two extremely large outlier values. These two points were eliminated from the data. The relationship between each predictor and shell ring count is approximately linear.
Like the size variables, the weight variables are approximately normally distributed with “Shucked weight” and “Whole weight” skewed toward larger weights. The relationship between each predictor and shell ring count is approximately linear but a quadratic equation may be more accurate for the “Whole weight” variable.
The abalone shell ring count follows a normal distribution (which is good news for the fishery as this indicates the current population is relatively healthy). The mean number of rings is between 9 and 10 with a positive skew (skew = 1.11) towards higher ring counts.
Correlations between variables were calculated to determine which variables may best predict ring count. Perhaps unsurprisingly, most of the weight variables are very closely correlated with each other, as are all the size measurements to themselves. “Length” and “Diameter” are almost perfectly positively correlated at 0.99. “Whole weight” has a very strong positive correlation with “Viscera weight” and “Shucked weight”, both at 0.97. “Shell weight” has the highest correlation with ring count at 0.63. All the variables were positively correlated with each other meaning when one increases all other variables increase. All correlations were significant with a p-value of less than 0.001.
A linear model which included all predictors was constructed.
| Coefficient | Estimate | Standard.error | Confidence.interval | t-value | p-value |
|---|---|---|---|---|---|
| (Intercept) | 2.78 | 0.268 | 2.252, 3.3034 | 10.4 | 7.8e-25 |
| SexF | 0.789 | 0.102 | 0.5893, 0.9888 | 7.74 | 1.21e-14 |
| SexM | 0.845 | 0.0953 | 0.6582, 1.0316 | 8.87 | 1.08e-18 |
| Length | -0.0058 | 0.009 | -0.0234, 0.0119 | -0.644 | 0.52 |
| Diameter | 0.0462 | 0.0111 | 0.0244, 0.0681 | 4.15 | 3.34e-05 |
| Height | 0.116 | 0.0114 | 0.094, 0.1386 | 10.2 | 3.27e-24 |
| Whole.weight | 0.0443 | 0.00361 | 0.0372, 0.0513 | 12.3 | 4.53e-34 |
| Shucked.weight | -0.0969 | 0.00407 | -0.1049, -0.089 | -23.8 | 1.28e-117 |
| Viscera.weight | -0.0556 | 0.00644 | -0.0682, -0.043 | -8.64 | 8.18e-18 |
| Shell.weight | 0.0384 | 0.00563 | 0.0274, 0.0495 | 6.82 | 1.01e-11 |
The model summary reported that almost all the variables used in the above regression were significant except for the “Length” variable (with a p-value of 0.52). This may be because “Length” and “Diameter” are so closely related that only one predictor is necessary to convey enough information about shell size. The adjusted R-squared value for this model was 0.5429, meaning approximately 54 percent of the variation observed in the abalone shell ring count can be explained by the model.
Two final linear models were constructed: one which only included predictors that are theoretically noninvasive to collect and another which could use any of the predictors in the data. The models were tweaked using the step() function which selects the ideal combination of predictors in a linear regression model via the Akaike Information Criterion (AIC). Each model was tested to determine how well they predicted population distribution in abalone. The data was randomly split 70/30 into training data (used to construct the model) and testing data (used to test the model).
The two final models were:
# Model 1: Ideal Noninvasive Regression
lm(Rings ~ +Sex +Height +Diameter +Length +Whole.weight,
data=training.abalone)
# Model 2: Ideal Non-restricted Regression
lm(Rings ~ +Sex +Height +Diameter +Whole.weight +Shucked.weight +Viscera.weight +Shell.weight,
data=training.abalone)
The noninvasive model (row 1) and the non-restricted model (row 2) were tested with the abalone dataset and the resulting predicted values were compared against the true ring counts (row 3).
| Model | R.squared | Minimum | First.Quartile | Median | Mean | Third.Quartile | Maximum | Standard.Deviation | Skew |
|---|---|---|---|---|---|---|---|---|---|
| Model 1: Noninvasive | 0.3813 | 3.594 | 8.593 | 10.376 | 10.006 | 11.521 | 14.896 | 2.028 | -0.488 |
| Model 2: Non-restricted | 0.5423 | -3.268 | 8.345 | 9.879 | 9.990 | 11.425 | 20.050 | 2.471 | 0.358 |
| True population values | NA | 3.000 | 8.000 | 9.000 | 9.940 | 11.000 | 27.000 | 3.230 | 1.107 |
The non-restricted model distribution (red), noninvasive model distribution (green), and true population distribution (blue) are shown below.
Notable results:
Both models predicted means within +/- 0.1 rings of the true population mean.
Both models predicted first and third quartiles within +/- 0.75 rings of the true quartiles.
The noninvasive model predicted data with a slight negative skew while the comprehensive model had a slight positive skew. The true population had a positive skew.
Both models underestimated the standard deviation of the data (mainly not accounting for the skew towards older abalone).
Overall, the non-restricted model followed the true abalone population ring count distribution the closest.
The goal of this project was to determine whether physical characteristics of abalone could be used to estimate the distribution of shell ring count (and in extension, age) in an abalone population. One of the constructed models only included data which would theoretically be noninvasive to collect, allowing for sampled abalone to be returned after being measured. The other model created for this project was not restricted, meaning it included data which would involve killing the abalone. Both models predicted population distributions which were fairly accurate, though the non-restricted model was more accurate in predicting population skew and standard deviation. This implies that the mass of abalone tissue is more strongly associated with its age than the size of its shell. While many variables would have to be collected to implement this model, often times marine biologists are collecting such data points already. Since counting abalone shell rings is a laborious and time consuming process, it may be advantageous to implement a model similar to one created in this project when conducting a survey of abalone with a large sample size.
“Green lip abalone from Western Australia” by “Thewildnomad” is licensed under CC-BY-SA 4.0.
“Abalone” UCI dataset - https://archive.ics.uci.edu/ml/datasets/Abalone