If a DataFrame has multiple rows, you can randomly select a few of them instead of working with the whole dataset. For example, Suppose you have this DataFrame with rows [A, B, C, D, E]. If you randomly pick 2 rows, one possible result could be [C, E].
Here is the sample DataFrame used in this article:
import pandas as pd
data = {'Employee': ['Emily', 'Emma', 'Jake', 'David', 'Eva'],
'Department': ['HR', 'IT', 'Finance', 'Marketing', 'IT'],
'Age': [28, 34, 25, 42, 30],
'Salary': [50000, 60000, 45000, 70000, 52000]}
df = pd.DataFrame(data)
print(df)
Output
Employee Department Age Salary
0 Emily HR 28 50000
1 Emma IT 34 60000
2 Jake Finance 25 45000
3 David Marketing 42 70000
4 Eva IT 30 52000
Letâs explore different methods to randomly select rows from a Pandas DataFrame.
Using sample()
The sample() method allows specifying the number of rows, a fraction of rows, whether to sample with replacement, weights and reproducibility via random_state.
Example: Below, we randomly select one row using sample().
row = df.sample()
print(row)
Output
Employee Department Age Salary
2 Jake Finance 25 45000
Explanation:
- df.sample() selects one random row by default.
- Returns a DataFrame with the sampled row.
- Each execution may return a different row unless random_state is set.
Using n parameter
The n parameter specifies the exact number of rows to select randomly.
Example: Here, we select three random rows from the DataFrame.
rows = df.sample(n=3)
print(rows)
Output
Employee Department Age Salary
2 Jake Finance 25 45000
3 David Marketing 42 70000
4 Eva IT 30 52000
Explanation:
- n=3 instructs Pandas to return 3 rows.
- Rows are selected randomly without replacement by default.
Using frac Parameter
The frac parameter selects a fraction of rows instead of a fixed number.
Example: In this example, we select 50% of rows randomly from the DataFrame.
sampled_df = df.sample(frac=0.5)
print(sampled_df)
Output
Employee Department Age Salary
2 Jake Finance 25 45000
3 David Marketing 42 70000
Explanation:
- frac=0.5 selects half of the DataFrame rows randomly.
- Useful when you want a proportional random sample instead of a fixed number.
Using replace=True
By default, sampling is without replacement. Setting replace=True allows the same row to be selected multiple times.
Example: This code select 5 rows randomly, allowing duplicates.
sampled_replace = df.sample(n=5, replace=True)
print(sampled_replace)
Output
Employee Department Age Salary
1 Emma IT 34 60000
2 Jake Finance 25 45000
0 Emily HR 28 50000
0 Emily HR 28 50000
0 Emily HR 28 50000
Explanation:
- replace=True allows the same row to appear multiple times.
- Useful for bootstrapping or resampling methods.
Using weights
The weights parameter assigns probabilities to rows so that some rows are more likely to be selected.
Example: This program select 3 rows with weighted probabilities.
weights = [0.1, 0.2, 0.3, 0.2, 0.2]
weighted_rows = df.sample(n=3, weights=weights)
print(weighted_rows)
Output
Employee Department Age Salary
0 Emily HR 28 50000
2 Jake Finance 25 45000
1 Emma IT 34 60000
Explanation:
- weights is a list of probabilities for each row.
- Rows with higher weights have a higher chance of being selected.
Using axis Parameter
sample() can also sample columns instead of rows by setting axis=1.
Example: Here, we select 2 random columns from the DataFrame.
col_sample = df.sample(n=2, axis=1)
print(col_sample)
Output
Department Salary
0 HR 50000
1 IT 60000
2 Finance 45000
3 Marketing 70000
4 IT 52000
Explanation:
- axis=1 changes the sampling from rows to columns.
- n=2 selects two columns randomly.
Using random_state for Reproducibility
random_state ensures the same rows are selected every time the code runs.
Example: In this example, we select 2 reproducible random rows.
fixed_rows = df.sample(n=2, random_state=42)
print(fixed_rows)
Output
Employee Department Age Salary
1 Emma IT 34 60000
4 Eva IT 30 52000
Explanation:
- random_state seeds the random number generator.
- Ensures the same random selection on each run.
Using NumPy
NumPy provides an alternative by selecting row indices randomly, then using loc to fetch rows.
Example: Here we select 3 random rows using NumPy.
import numpy as np
indices = np.random.choice(df.index, size=3, replace=False)
np_rows = df.loc[indices]
print(np_rows)
Output
Employee Department Age Salary
4 Eva IT 30 52000
0 Emily HR 28 50000
3 David Marketing 42 70000
Explanation:
- np.random.choice randomly selects row indices.
- replace=False ensures no duplicates.
- df.loc[indices] fetches the corresponding rows.
Related Article: