Output Data Format
The output data is characterized by zero or more matched page sets, one or more matched pages, and N keys that are defined in the Scraping definition.
JSON Output#
The following is a minimalist output format, but loses certain information.
First off, we highly recommend enriching the data with semantic information. All implementations SHOULD parse individual page scraping definitions, and output data, AND provide warnings if the data is not a properly structured JSON-LD document.
Additionally, production environments SHOULD throw an error if the result is not properly structured.
This allows for each page to be consumed by a system that can easily understand its context and place it directly into a knowledge graph.
Example#
Let's say our scraper retrieved this data:
Data Size and Visualization#
This results in a total data size of:
Where is the number of Page Sets, is the number of Relevant Data attributes for set i, and is the number of pages for that set.
Generally, for a static Scraping Definition, only is subject to change based on new pages which may be posted and match a page set, or different slices of an offline dataset which the scraper looks at.