Down-sampling with NumPy

Author

Andres Monge

Published

December 13, 2024

Down-sampling involves reducing the size of an array while preserving its essential information.

One common method is to group elements into blocks and compute the mean of each block.

Understanding the Group Size Formula

To find the group size, you divide the total number of elements by the desired number of groups:

Code

import math

total_elements = 12
shrink_to_elements = 4

group_size = math.ceil(total_elements / shrink_to_elements)

In our example:

Total Elements: 12
Shrink to elements: 4
Therefore, Group Size = 3

Steps

Reshaping: The array is reshaped to group elements into blocks that can be averaged.
Averaging: The mean of each block is calculated to downsample the array.

Creating Groups with Reshape

The reshape function is used to rearrange the elements of the array into a new shape. In the context of down-sampling, we use reshape to group elements into blocks. Let’s break down how this works:

Original Array: We start with a 1D array of 12 elements.
Reshaping: We reshape this array into a 2D array with shape (4, 3). This means we have 4 rows, each containing 3 elements.
Grouping: By doing this, we effectively group the original 12 elements into 4 blocks of 3 elements each.

Example

Code

import numpy as np

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
arr_tuple = np.array([
    (1, 2, 3), (4, 5, 6), (7, 8, 9), (10, 11, 12), (13, 14, 15), (16, 17, 18), 
    (19, 20, 21), (22, 23, 24), (25, 26, 27), (28, 29, 30), (31, 32, 33), (34, 35, 36)
])

"""
Downscaling factor
------------------
>>> downscale_factor = 12 // 4 
>>> downscale_factor = len(arr) // target_size 
>>> downscale_factor = arr.shape[0] // target_size
"""
downscale_factor = 3

"""
Reshape the array to group elements
-----------------------------------
>>> reshaped_arr_tuple = arr_tuple.reshape(-1, downscale_factor, arr_tuple.shape[1]) 
"""
reshaped_arr = arr.reshape(-1, downscale_factor)
reshaped_arr_tuple = arr_tuple.reshape(-1, downscale_factor, 3)

"""
Assert the shape of the reshaped array
"""
assert reshaped_arr.shape == (4, 3)
assert reshaped_arr_tuple.shape == (4, 3, 3)

"""
Downsample the array by taking the mean of each block
"""
downsampled_arr = reshaped_arr.mean(axis=1)
downsampled_arr_tuple = reshaped_arr_tuple.mean(axis=1)

"""
Assert the shape of the downsampled array
"""
assert arr.shape == (12,)
assert downsampled_arr.shape == (4,)
assert arr_tuple.shape == (12, 3)
assert downsampled_arr_tuple.shape == (4, 3)

"""
Results
"""
assert np.array_equal(downsampled_arr, np.array([2.0, 5.0, 8.0, 11.0]))
assert np.array_equal(downsampled_arr_tuple, np.array([
    [4.0, 5.0, 6.0],
    [13.0, 14.0, 15.0],
    [22.0, 23.0, 24.0],
    [31.0, 32.0, 33.0]
]))

Explanation

Axis Parameter (axis=1): The axis parameter in the mean function specifies along
which axis of the array the mean should be calculated. In a multidimensional array, axis=0 refers to the rows (or the first dimension), and axis=1 refers to the columns (or the second dimension). Here, axis=1 means that the mean will be calculated along the second dimension, which is specified by downscale_factor.
Mean Calculation: When calculating the mean along axis=1, NumPy will compute the mean of each block of downscale_factor elements. Since each block contains tuples of three values, the mean will be calculated individually for each corresponding element across the tuples in a block. This results in a new array where each element is the mean of the corresponding elements from the tuples in the original block.
Reshaping: The -1 in the reshape function tells NumPy to calculate the appropriate size for that dimension.
Averaging: The mean of each block is calculated along axis 1.
Three Parameters: The reshape function here takes three parameters to specify the new shape of the array. These parameters are used to transform the original array into a multidimensional array with the specified dimensions. - First Parameter (-1): This is an automatic dimension. When you use -1 in the reshape function, NumPy calculates the appropriate size for that dimension based on the original array’s size and the other dimensions you specify. This means that if your original array has N elements and you specify downscale_factor and 3 as the other dimensions, the -1 will be replaced with a value such that N = (-1) * downscale_factor * 3. - Second Parameter (downscale_factor): This specifies the second dimension of the reshaped array. In this context, it’s used to group the elements of the original array into blocks of size downscale_factor. - Third Parameter (3): This specifies the third dimension of the reshaped array. Since arr_tuple contains tuples of three elements each, this dimension ensures that each tuple remains intact as a separate entity within the reshaped array.

Illustrate the mean calculation

If you have a reshaped array like this:

Code

[
    [(1, 2, 3),    (4, 5, 6),    (7, 8, 9)],
    [(10, 11, 12), (13, 14, 15), (16, 17, 18)],
    [(19, 20, 21), (22, 23, 24), (25, 26, 27)],
    [(28, 29, 30), (31, 32, 33), (34, 35, 36)]
]

[[(1, 2, 3), (4, 5, 6), (7, 8, 9)],
 [(10, 11, 12), (13, 14, 15), (16, 17, 18)],
 [(19, 20, 21), (22, 23, 24), (25, 26, 27)],
 [(28, 29, 30), (31, 32, 33), (34, 35, 36)]]

And you calculate the mean along axis=1, you will get:

Code

[
    ((1+4+7)/3,    (2+5+8)/3,    (3+6+9)/3),
    ((10+13+16)/3, (11+14+17)/3, (12+15+18)/3),
    ((19+22+25)/3, (20+23+26)/3, (21+24+27)/3),
    ((28+31+34)/3, (29+32+35)/3, (30+33+36)/3)
]

[(4.0, 5.0, 6.0), (13.0, 14.0, 15.0), (22.0, 23.0, 24.0), (31.0, 32.0, 33.0)]

Allow any size of down-sample

Downsampling requires for the shrink_to_elements to be a multiple of downscale_factor, without any remainder. If shrink_to_elements is not a multiple of downscale_factor, we need to add the missing elements to the end of the array.

Numpy has a function np.pad that allows us to add elements to the end of an array. Not only we can add zeros, a constant value or the mean of the array.

In this case, we will use the np.pad(mode="mean", stat_length=...),

The missing elements should be added to the end of the array.

Code

uneven_arr = np.array(
    [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0, 11.0, 12.0, 13.0, 14.0]
)

target_size = 4
total_elements = uneven_arr.shape[0]  # 14
downscale_factor = math.ceil(total_elements / target_size)  # 4

# Calculate missing_elements
missing_elements = downscale_factor * target_size - total_elements
remaining_elements = total_elements % downscale_factor

# Pad the array to make its length divisible by the downscale factor
mean_value = np.mean(uneven_arr[-remaining_elements:]) if remaining_elements != 0 else 0
padded_arr = np.pad(
    uneven_arr, 
    pad_width=(0, missing_elements), 
    mode='constant', 
    constant_values=mean_value
)

# Assertions to verify correctness
assert np.array_equal(padded_arr, np.array([
     1.0,  2.0,  3.0,  4.0,
     5.0,  6.0,  7.0,  8.0,
     9.0, 10.0, 11.0, 12.0,
    13.0, 14.0, 13.5, 13.5
]))

# Reshape the array
reshaped_uneven_arr = padded_arr.reshape(-1, downscale_factor)
assert np.array_equal(reshaped_uneven_arr, np.array([
     [1.0,  2.0,  3.0,  4.0],
     [5.0,  6.0,  7.0,  8.0],
     [9.0, 10.0, 11.0, 12.0],
    [13.0, 14.0, 13.5, 13.5]
]))

Additional Explanations

Padding for Uneven Arrays: The code demonstrates how to handle arrays whose length is not evenly divisible by the target size.

Calculating Remaining Elements:

Code

remaining_elements = total_elements % downscale_factor

This calculates how many elements are left over after dividing the array into groups of downscale_factor size.

Mean Value for Padding:

Code

mean_value = np.mean(uneven_arr[-remaining_elements:]) if remaining_elements != 0 else 0

This calculates the mean of the remaining elements to use as padding. If there are no remaining elements, it defaults to 0.

Padding the Array:

Code

padded_arr = np.pad(
    uneven_arr, 
    pad_width=(0, missing_elements), 
    mode='constant', 
    constant_values=mean_value
)

This pads the array with the calculated mean value to make its length divisible by the downscale factor.

Reshaping Padded Array:

Code

reshaped_uneven_arr = padded_arr.reshape(-1, downscale_factor)

After padding, the array is reshaped into groups of downscale_factor size. Flexibility in Down-sampling: This approach allows for down-sampling to any target size, not just sizes that are factors of the original array length.

Preserving Data Characteristics: By using the mean of remaining elements for padding, the method preserves the statistical properties of the data at the end of the array.

Conclusion

This example demonstrates how to down-sample a NumPy array by averaging groups of elements.

Final Code

Code

def downsample(arr: np.ndarray, target_size: int) -> np.ndarray:
    """
    Downsample a NumPy array by averaging groups of elements.

    Example
    -------
    >>> arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14])
    ... target_size = 4
    ... downsampled_arr = downsample(arr, target_size)
    ... print(downsampled_arr)
    <<< [ 2.5  6.5 10.5 13.5]

    Parameters
    ----------
    arr : np.ndarray
        The input array to be downsampled.
    target_size : int
        The desired number of elements after downsampling.

    Returns
    -------
    np.ndarray
        The downsampled array.
    """
    total_elements = arr.shape[0]
    downscale_factor = math.ceil(total_elements / target_size)
    
    # Calculate missing elements
    missing_elements = downscale_factor * target_size - total_elements
    remaining_elements = total_elements % downscale_factor
    
    # Pad the array to make its length divisible by the downscale factor
    if remaining_elements != 0:
        mean_value = np.mean(arr[-remaining_elements:])
    else:
        mean_value = 0
    
    padded_arr = np.pad(
        arr, 
        pad_width=(0, missing_elements), 
        mode='constant', 
        constant_values=mean_value
    )
    
    # Reshape the array
    reshaped_arr = padded_arr.reshape(-1, downscale_factor)
    
    # Downsample the array by taking the mean of each block
    return reshaped_arr.mean(axis=1)

def downsample_tupled(arr: np.ndarray, target_size: int) -> np.ndarray:
    """
    Downsample a NumPy array by averaging groups of elements.

    This other implementation uses `vstack` instead of `pad`.

    Example
    -------
    >>> print(
    ...     downsample_tupled(
    ...         np.array([[2, 2], [2, 4], [2, 6], [7, 8], [9, 10], [11, 12], [13, 14]]), 3
    ...     )
    ... )
    <<< [[ 2.  4.]
    ...  [ 9. 10.]
    ...  [13. 14.]]
    >>> print(downsample_tupled(
    ...     np.array(
    ...         [
    ...             [1, 2, 7],
    ...             [3, 4, 8],
    ...             [5, 6, 9],
    ...             [7, 8, 10],
    ...             [9, 10, 13],
    ...             [11, 12, 14],
    ...             [13, 14, 15],
    ...         ]
    ...     ),
    ...     3,
    ... ))
    <<< [[ 3.          4.          8.        ]
    ...  [ 9.         10.         12.33333333]
    ...  [13.         14.         15.        ]]

    Parameters
    ----------
    arr : np.ndarray
        The input array to be downsampled.
    target_size : int
        The desired number of elements after downsampling.

    Returns
    -------
    np.ndarray
        The downsampled array.
    """
    total_elements = arr.shape[0]
    downscale_factor = math.ceil(total_elements / target_size)

    # Calculate missing elements
    missing_elements = downscale_factor * target_size - total_elements
    remaining_elements = total_elements % downscale_factor

    # Calculate the mean of the last `remaining_elements` rows
    mean_value = (
        np.mean(arr[-remaining_elements:], axis=0)
        if remaining_elements != 0
        else np.zeros(arr.shape[1:], dtype=np.float64)
    )

    # Append the missing rows using `hstack`
    padding_rows = np.tile(mean_value, (missing_elements, 1))
    padded_arr = np.vstack([arr, padding_rows])

    # Reshape the array
    reshaped_arr = padded_arr.reshape(-1, downscale_factor, *arr.shape[1:])

    # Downsample the array by taking the mean of each block
    return reshaped_arr.mean(axis=1)