Skip to content

Transformrows

flowtask.components.TransformRows

TransformRows.

TransformRows allow making column transformation over a Pandas Dataframe.

TransformRows

TransformRows

TransformRows(loop=None, job=None, stat=None, **kwargs)

Bases: FlowComponent

TransformRows

Overview

The TransformRows class is a component for transforming, adding, or modifying rows in a Pandas DataFrame based on
specified criteria. It supports single and multiple DataFrame transformations, and various operations on columns.

.. table:: Properties :widths: auto

+--------------------+----------+-------------------------------------------------------------------------------------------------------+
| Name               | Required | Description                                                                                           |
+--------------------+----------+-------------------------------------------------------------------------------------------------------+
| fields             |   No     | A dictionary defining the fields and corresponding transformations to be applied.                     |
+--------------------+----------+-------------------------------------------------------------------------------------------------------+
| filter_conditions  |   No     | A dictionary defining the filter conditions for transformations.                                      |
+--------------------+----------+-------------------------------------------------------------------------------------------------------+
| clean_notnull      |   No     | Boolean flag indicating if non-null values should be cleaned, defaults to True.                       |
+--------------------+----------+-------------------------------------------------------------------------------------------------------+
| replace_columns    |   No     | Boolean flag indicating if columns should be replaced, defaults to False.                             |
+--------------------+----------+-------------------------------------------------------------------------------------------------------+
| multi              |   No     | Boolean flag indicating if multiple DataFrame transformations should be supported, defaults to False. |
+--------------------+----------+-------------------------------------------------------------------------------------------------------+
| function           |   No     | View the list of function in the functions.py file on this directory                                  |
+--------------------+----------+-------------------------------------------------------------------------------------------------------+
| _applied           |   No     | List to store the applied transformations.                                                            |
+--------------------+----------+-------------------------------------------------------------------------------------------------------+

Return

The methods in this class manage the transformation of DataFrames, including initialization, execution, and result handling.



Example:

```yaml
TransformRows:
  fields:
    display_name:
      value:
      - concat
      - columns:
        - first_name
        - last_name
    legal_address:
      value:
      - concat
      - columns:
        - legal_street_address_1
        - legal_street_address_2
    work_address:
      value:
      - concat
      - columns:
        - work_location_address_1
        - work_location_address_2
    first_name:
      value:
      - capitalize
    last_name:
      value:
      - capitalize
    warp_id:
      value:
      - nullif
      - chars:
        - '*'
    old_warp_id:
      value:
      - nullif
      - chars:
        - '*'
    worker_category_description:
      value:
      - case
      - column: benefits_eligibility_class_code
        condition: PART-TIME
        match: Part Time
        notmatch: Full Time
    file_number:
      value:
      - ereplace
      - columns:
        - position_id
        - payroll_group
        newvalue: ''
    original_hire_date:
      value:
      - convert_to_datetime
    hire_date:
      value:
      - convert_to_datetime
    start_date:
      value:
      - convert_to_datetime
    updated:
      value:
      - convert_to_datetime
    gender_code:
      value:
      - convert_to_string
    payroll_id:
      value:
      - convert_to_string
    reports_to_payroll_id:
      value:
      - convert_to_string
```
start async
start(**kwargs)

Obtain Pandas Dataframe.

functions

Functions.

Tree of TransformRows functions.

add_timestamp_to_time

add_timestamp_to_time(df, field, date, time)

Takes a pandas DataFrame and combines the values from a date column and a time column to create a new timestamp column.

:param df: pandas DataFrame to be modified. :param field: Name of the new column to store the combined timestamp. :param date: Name of the column in the df DataFrame containing date values. :param time: Name of the column in the df DataFrame containing time values. :return: Modified pandas DataFrame with the combined timestamp stored in a new column.

any_tuple_valid

any_tuple_valid(df, field, columns)

Adds a boolean column (named field) to df that is True when any tuple in columns has all of its columns neither NaN nor empty.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame.

required
field str

The name of the output column.

required
columns list of tuple of str

List of tuples, where each tuple contains column names that must be checked. Example: [("start_lat", "start_long"), ("end_lat", "end_log")]

required

Returns:

Type Description
DataFrame

pd.DataFrame: The original DataFrame with the new field column.

apply_function

apply_function(df, field, fname, column=None, **kwargs)

Apply any scalar function to a column in the DataFrame.

Parameters: - df: pandas DataFrame - field: The column where the result will be stored. - fname: The name of the function to apply. - column: The column to which the function is applied (if None, apply to field column). - **kwargs: Additional arguments to pass to the function.

bytesio_to_base64

bytesio_to_base64(df, field, column, as_string=False, as_image=True, image_mime='image/png')

Converts bytes in a DataFrame column to a Base64 encoded string.

:param df: The DataFrame containing the bytes column. :param field: The name of the field to store the Base64 encoded string. :param column: The name of the bytes column. :param as_string: If True, converts the Base64 bytes to a string. :return: The DataFrame with the Base64 encoded string.

calculate_distance

calculate_distance(df, field, columns, unit='km', chunk_size=1000)

Add a distance column to a dataframe.

Parameters:

Name Type Description Default
df DataFrame

pandas DataFrame with columns 'latitude', 'longitude', 'store_lat', 'store_lng'

required
columns List[tuple]

list of tuples with column names for coordinates - First tuple: [latitude1, longitude1] - Second tuple: [latitude2, longitude2]

required
unit str

unit of distance ('km' for kilometers, 'm' for meters, 'mi' for miles)

'km'
chunk_size int

number of rows to process at once for large datasets

1000

Returns:

Type Description
DataFrame

df with additional 'distance_km' column

convert_timezone

convert_timezone(df, field, *, column=None, from_tz='UTC', to_tz=None, tz_column=None, default_timezone='UTC')

Convert field to a target time‑zone.

Parameters

df : DataFrame field : name of an existing datetime column column : name of the output column (defaults to field) from_tz : timezone used to localise naive timestamps to_tz : target timezone (ignored if tz_column is given) tz_column : optional column that contains a timezone per row default_tz: fallback when a row's tz_column is null/NaN

Returns:

Type Description
DataFrame

df with converted datetime column

create_attachment_column

create_attachment_column(df, field, columns, colnames=None)

Create a column with a list of attachments from one or more path/URL columns.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
field str

Name of the new column to store the list of attachments.

required
columns List[str]

Column names to convert. You can pass either the exact column (e.g., "pdf_path_m0") or the base name (e.g., "pdf_path").

required
colnames Optional[Dict[str, str]]

Optional list of names for the attachments. If not provided, the column names will be used as names.

None

Returns:

Type Description
DataFrame

The same DataFrame with field added.

day_of_week

day_of_week(df, field, column, locale='en_US.utf8')

Extracts the day of the week from a date column.

:param df: The DataFrame containing the date column. :param field: The name of the field to store the day of the week. :param column: The name of the date column. :return: The DataFrame with the day of the week.

drop_timezone

drop_timezone(df, field, column=None)

Drop the timezone information from a datetime column.

Parameters:

Name Type Description Default
df DataFrame

pandas DataFrame with a datetime column

required
field str

name of the datetime column

required

Returns:

Type Description
DataFrame

df with timezone-free datetime column

duration

duration(df, field, columns, unit='s')

Converts a duration column to a specified unit.

:param df: The DataFrame containing the duration column. :param field: The name of the field to store the converted duration. :param column: The name of the duration column. :param unit: The unit to convert the duration to. :return: The DataFrame with the converted duration.

extract_from_dictionary

extract_from_dictionary(df, field, column, key, conditions=None, as_timestamp=False)

Extracts a value from a JSON column in the DataFrame.

:param df: The DataFrame containing the JSON column. :param field: The name of the field to store the extracted value. :param column: The name of the JSON column. :param key: The key to extract from the JSON object. :param conditions: Optional dictionary of conditions to filter rows before extraction. :param as_timestamp: If True, converts the extracted value to a timestamp. :return: The DataFrame with the extracted value.

extract_from_object

extract_from_object(df, field, column, key, as_string=False, as_timestamp=False)

Extracts a value from an object column in the DataFrame.

:param df: The DataFrame containing the object column. :param field: The name of the field to store the extracted value. :param column: The name of the object column. :param key: The key to extract from the object. :param as_string: If True, converts the extracted value to a string. :param as_timestamp: If True, converts the extracted value to a timestamp. :return: The DataFrame with the extracted value.

fully_geoloc

fully_geoloc(df, field, columns, inverse=False)

Adds a boolean column (named field) to df that is True when, for each tuple in columns, all the involved columns are neither NaN nor empty.

Parameters:

Name Type Description Default
df DataFrame

The DataFrame.

required
field str

The name of the output column.

required
columns list of tuple of str

List of tuples, where each tuple contains column names that must be valid (non-null and non-empty). Example: [("start_lat", "start_long"), ("end_lat", "end_log")]

required

Returns:

Type Description
DataFrame

pd.DataFrame: The original DataFrame with the new field column.

get_moment

get_moment(df, field, column, moments=None)

df: pandas DataFrame column: name of the column to compare (e.g. "updated_hour") ranges: list of tuples [(label, (start, end)), ...] e.g. [("night",(0,7)), ("morning",(7,10)), ...] returns: a Series of labels corresponding to each row

get_product

get_product(row, field, columns)

Retrieves product information from the Barcode Lookup API based on a barcode.

:param row: The DataFrame row containing the barcode. :param field: The name of the field containing the barcode. :param columns: The list of columns to extract from the API response. :return: The DataFrame row with the product information.

haversine_distance

haversine_distance(lat1, lon1, lat2, lon2, unit='km')

Distance between two points on Earth in kilometers.

path_to_url

path_to_url(df, field, column=None, base_path='files/', base_url='https://example.com/files/')

Converts a file path in a DataFrame column to a URL. Replaces the base path with the base URL.

:param df: The DataFrame containing the file path column. :param field: The name of the field to store the URL. :param column: The name of the file path column (defaults to field). :param base_path: The base path to replace in the file path. :param base_url: The base URL to use for the conversion.

:return: The DataFrame with the URL in the specified field.

string_to_vector

string_to_vector(df, field)

Converts a string representation of a list into an actual list.

:param df: The DataFrame containing the string representation. :param field: The name of the field to convert. :return: The DataFrame with the converted field.

upc_to_product

upc_to_product(df, field, columns=['barcode_formats', 'mpn', 'asin', 'title', 'category', 'model', 'brand'])

Converts UPC codes in a DataFrame to product information using the Barcode Lookup API.

:param df: The DataFrame containing the UPC codes. :param field: The name of the field containing the UPC codes. :param columns: The list of columns to extract from the API response. :return: The DataFrame with the product information.