Skip to content

Productinfo

flowtask.components.ProductInfo

parsers

base

ParserBase
ParserBase(*args, **kwargs)

Bases: HTTPService, SeleniumService

Base class for product information parsers.

Defines the interface and common functionality for all product parsers.

create_search_query
create_search_query(term)

Create a search query for the given term.

Parameters:

Name Type Description Default
term str

Search term (typically product model)

required

Returns:

Type Description
str

Formatted search query

extract_model_code
extract_model_code(url)

Extract model code from URL using the regex pattern if defined.

Parameters:

Name Type Description Default
url str

URL to extract model code from

required

Returns:

Type Description
Optional[str]

Extracted model code or None if not found or pattern not defined

get_product_urls
get_product_urls(search_results, max_urls=5)

Extract relevant product URLs from search results.

Parameters:

Name Type Description Default
search_results List[Dict[str, str]]

List of search result dictionaries

required
max_urls int

Maximum number of URLs to return

5

Returns:

Type Description
List[str]

List of product URLs

parse abstractmethod async
parse(url, search_term)

Parse product information from a URL.

Parameters:

Name Type Description Default
url str

URL to parse

required
search_term str

Original search term

required

Returns:

Type Description
Dict[str, Any]

Dictionary with extracted product information

brother

BrotherParser
BrotherParser(*args, **kwargs)

Bases: ParserBase

Parser for Brother product information.

Extracts product details from Brother's USA website using Selenium.

get_product_urls
get_product_urls(search_results, max_urls=5)

Extract relevant product URLs from search results.

Parameters:

Name Type Description Default
search_results List[Dict[str, str]]

List of search result dictionaries

required
max_urls int

Maximum number of URLs to return

5

Returns:

Type Description
List[str]

List of product URLs that match the Brother product pattern

parse async
parse(url, search_term, retailer=None)

Parse product information from a Brother URL using Selenium.

Parameters:

Name Type Description Default
url str

Brother product URL

required
search_term str

Original search term

required
retailer Optional[str]

Optional retailer information (not used for Brother)

None

Returns:

Type Description
Dict[str, Any]

Dictionary with product information

canon

CanonParser
CanonParser(*args, **kwargs)

Bases: ParserBase

Parser for Canon product information.

Extracts product details from Canon's USA and Canada websites using Selenium.

create_search_query
create_search_query(term)

Create region-specific search query.

Parameters:

Name Type Description Default
term str

Search term (typically product model)

required

Returns:

Type Description
str

Formatted search query for the appropriate region

determine_region
determine_region(retailer)

Determine region based on retailer information.

Parameters:

Name Type Description Default
retailer Optional[str]

Retailer string that may contain region information

required

Returns:

Type Description
str

'ca' for Canada, 'us' for United States (default)

get_product_urls
get_product_urls(search_results, max_urls=5)

Extract relevant product URLs from search results.

Parameters:

Name Type Description Default
search_results List[Dict[str, str]]

List of search result dictionaries

required
max_urls int

Maximum number of URLs to return

5

Returns:

Type Description
List[str]

List of product URLs that match the Canon product pattern

parse async
parse(url, search_term, retailer=None)

Parse product information from a Canon URL using Selenium.

Parameters:

Name Type Description Default
url str

Canon product URL

required
search_term str

Original search term

required
retailer Optional[str]

Optional retailer information to determine region

None

Returns:

Type Description
Dict[str, Any]

Dictionary with product information

epson

EpsonParser
EpsonParser(*args, **kwargs)

Bases: ParserBase

Parser for Epson product information.

Extracts product details from Epson's website.

extract_model_code
extract_model_code(url)

Extract model code from URL using the regex pattern and clean it.

Parameters:

Name Type Description Default
url str

URL to extract model code from

required

Returns:

Type Description
Optional[str]

Cleaned model code or None if not found

parse async
parse(url, search_term, retailer=None)

Parse product information from an Epson URL.

Parameters:

Name Type Description Default
url str

Epson product URL

required
search_term str

Original search term

required
retailer str

Optional retailer information

None

Returns:

Type Description
Dict[str, Any]

Dictionary with product information

hp

HPParser
HPParser(*args, **kwargs)

Bases: ParserBase

Parser for HP product information.

Extracts product details from HP's website using Selenium for dynamic content.

get_product_urls
get_product_urls(search_results, max_urls=5)

Extract relevant product URLs from search results.

Parameters:

Name Type Description Default
search_results List[Dict[str, str]]

List of search result dictionaries

required
max_urls int

Maximum number of URLs to return

5

Returns:

Type Description
List[str]

List of product URLs that match the HP product pattern

parse async
parse(url, search_term, retailer=None)

Parse product information from an HP URL using Selenium.

Parameters:

Name Type Description Default
url str

HP product URL

required
search_term str

Original search term

required

Returns:

Type Description
Dict[str, Any]

Dictionary with product information

samsung

SamsungParser
SamsungParser(*args, **kwargs)

Bases: ParserBase

Parser for Samsung product information.

Extracts product details from Samsung's website using Selenium.

get_product_urls
get_product_urls(search_results, max_urls=1)

Extract relevant product URLs from search results.

Parameters:

Name Type Description Default
search_results List[Dict[str, str]]

List of search result dictionaries

required
max_urls int

Maximum number of URLs to return (default: 1)

1

Returns:

Type Description
List[str]

List of product URLs that match the Samsung product pattern

parse async
parse(url, search_term, retailer=None)

Parse product information from a Samsung URL using Selenium.

Parameters:

Name Type Description Default
url str

Samsung product URL

required
search_term str

Original search term

required
retailer Optional[str]

Optional retailer information (not used for Samsung)

None

Returns:

Type Description
Dict[str, Any]

Dictionary with product information

scraper

ProductInfo

ProductInfo(loop=None, job=None, stat=None, **kwargs)

Bases: FlowComponent, HTTPService, SeleniumService

Product Information Scraper Component

This component extracts detailed product information by: 1. Searching for products using search terms 2. Extracting model codes from URLs 3. Parsing product details from manufacturer websites

Configuration options: - search_column: Column name containing search terms (default: 'model') - parsers: List of parser names to use (default: ['epson']) - max_results: Maximum number of search results to process (default: 5) - concurrently: Process items concurrently (default: True) - task_parts: Number of parts to split concurrent tasks (default: 10)

close async
close()

Clean up resources.

run async
run()

Execute product info extraction for each row.

split_parts
split_parts(tasks, num_parts=5)

Split tasks into parts for concurrent processing.

start async
start(**kwargs)

Initialize component and validate requirements.