In this article, we are going to see how to use Boston Datasets using Sklearn. The Boston Housing dataset, one of the most widely recognized datasets in the field of machine learning, is a collection of data derived from the Boston Standard Metropolitan Statistical Area (SMSA) in the 1970s. This dataset is commonly used in regression analysis to predict the median value of homes in the Boston area based on various predictive variables.
Note: The
bostondataset has been deprecated in scikit-learn (removed from version 1.2) due to ethical concerns, so it is recommended to use alternative datasets likefetch_california_housing.
Understanding Boston Dataset
These datasets are pre-build datasets in sklearn. To load and return the boston house-prices dataset (regression).
- Samples total - 506
- Dimensionality - 13
- Features - real, positive
- Targets - real 5. - 50.
Description of Boston Dataset in Sklearn
The Boston Housing dataset contains several columns that are used to describe various aspects of residential homes in Boston. Here is a description of each column in the dataset:
- CRIM: Per capita crime rate by town. It indicates the level of crime in the area.
- ZN: Proportion of residential land zoned for lots over 25,000 sq.ft. This feature reflects the area's residential density.
- INDUS: Proportion of non-retail business acres per town. This is an indicator of the commercial use of land away from residential areas.
- CHAS: Charles River dummy variable (1 if tract bounds river; 0 otherwise). This indicates whether the property is near the Charles River, which may add to the aesthetic value of the neighborhood.
- NOX: Nitric oxides concentration (parts per 10 million). It represents the level of industrial pollutants in the area.
- RM: Average number of rooms per dwelling. More rooms typically indicate more spacious accommodation.
- AGE: Proportion of owner-occupied units built prior to 1940. Older structures might lack newer amenities or could be considered more prestigious depending on the architecture and condition.
- DIS: Weighted distances to five Boston employment centres. This feature measures the accessibility to workplaces, which can influence housing prices.
- RAD: Index of accessibility to radial highways. Higher values indicate easier access to major roadways.
- TAX: Full-value property-tax rate per $10,000. This reflects the annual property tax rate.
- PTRATIO: Pupil-teacher ratio by town. Lower values typically indicate better educational facilities, which is a significant factor for families when choosing a home.
How to Load Boston Dataset in Sklearn
To load the Boston Housing dataset in sklearn, you can use the load_boston function from sklearn.datasets. However, it's important to note that as of version 1.2, the use of load_boston() is deprecated in scikit-learn due to ethical concerns regarding the dataset. The recommended approach is to use an alternative dataset like the California housing dataset or to download the CSV from a trusted source if you still need to use the Boston dataset specifically for educational purposes.
Syntax of Boston Dataset in Sklearn
Syntax: sklearn.datasets.load_boston()
In this following code we will load Sklearn dataset.
import pandas as pd
from sklearn.datasets import load_boston
# Load the dataset
boston = load_boston()
df = pd.DataFrame(boston.data, columns=boston.feature_names)
# Display the DataFrame
print(df)
Output:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
0 0.00632 18.0 2.31 0.0 0.538 6.575 65.2 4.0900 1.0 296.0 15.3 396.90 4.98
1 0.02731 0.0 7.07 0.0 0.469 6.421 78.9 4.9671 2.0 242.0 17.8 396.90 9.14
2 0.02729 0.0 7.07 0.0 0.469 7.185 61.1 4.9671 2.0 242.0 17.8 392.83 4.03
3 0.03237 0.0 2.18 0.0 0.458 6.998 45.8 6.0622 3.0 222.0 18.7 394.63 2.94
4 0.06905 0.0 2.18 0.0 0.458 7.147 54.2 6.0622 3.0 222.0 18.7 396.90 5.33