Jobs
****

Jobs are the kind of things that *urlwatch(1)* can monitor.

The list of jobs to run are contained in the configuration file
"urls.yaml", accessed with the command "urlwatch --edit", each
separated by a line containing only "---". The command "urlwatch
--list" prints the name of each job, along with its index number (1,
2, 3, …) which gets assigned automatically according to its position
in the configuration file.

While optional, it is recommended that each job starts with a "name"
entry:

   name: "This is a human-readable name/label of the job"

The following job types are available:


URL
===

This is the main job type – it retrieves a document from a web server:

   name: "urlwatch homepage"
   url: "https://thp.io/2008/urlwatch/"

Required keys:

* "url": The URL to the document to watch for changes

Job-specific optional keys:

* "cookies": Cookies to send with the request (see Advanced Topics)

* "method": HTTP method to use (default: "GET")

* "data": HTTP POST/PUT data

* "ssl_no_verify": Do not verify SSL certificates (true/false)

* "ignore_cached": Do not use cache control (ETag/Last-Modified)
  values (true/false)

* "http_proxy": Proxy server to use for HTTP requests

* "https_proxy": Proxy server to use for HTTPS requests

* "headers": HTTP header to send along with the request

* "encoding": Override the character encoding from the server (see
  Advanced Topics)

* "timeout": Override the default socket timeout (see Advanced Topics)

* "ignore_connection_errors": Ignore (temporary) connection errors
  (see Advanced Topics)

* "ignore_http_error_codes": List of HTTP errors to ignore (see
  Advanced Topics)

* "ignore_timeout_errors": Do not report errors when the timeout is
  hit

* "ignore_too_many_redirects": Ignore redirect loops (see Advanced
  Topics)

(Note: "url" implies "kind: url")


Browser
=======

This job type is a resource-intensive variant of “URL” to handle web
pages requiring JavaScript in order to render the content to be
monitored.

The optional "pyppeteer" package must be installed to run “Browser”
jobs (see Dependencies).

At the moment, the Chromium version used by "pyppeteer" only supports
macOS (x86_64), Windows (both x86 and x64) and Linux (x86_64). See
this issue in the Pyppeteer issue tracker for progress on getting ARM
devices supported (e.g. Raspberry Pi).

Because "pyppeteer" downloads a special version of Chromium (~ 100
MiB), the first execution of a "browser" job could take some time (and
bandwidth). It is possible to run "pyppeteer-install" to pre-download
Chromium.

   name: "A page with JavaScript"
   navigate: "https://example.org/"

Required keys:

* "navigate": URL to navigate to with the browser

Job-specific optional keys:

* "wait_until":  Either "load", "domcontentloaded", "networkidle0", or
  "networkidle2" (see Advanced Topics)

As this job uses Pyppeteer to render the page in a headless Chromium
instance, it requires massively more resources than a “URL” job. Use
it only on pages where "url" does not give the right results.

Hint: in many instances instead of using a “Browser” job you can
monitor the output of an API called by the site during page loading
containing the information you’re after using the much faster “URL”
job type.

(Note: "navigate" implies "kind: browser")


Shell
=====

This job type allows you to watch the output of arbitrary shell
commands, which is useful for e.g. monitoring an FTP uploader folder,
output of scripts that query external devices (RPi GPIO), etc…

   name: "What is in my Home Directory?"
   command: "ls -al ~"

Required keys:

* "command": The shell command to execute

Job-specific optional keys:

* "stderr": Change how standard error is treated, see below

(Note: "command" implies "kind: shell")


Configuring "stderr" behavior for shell jobs
--------------------------------------------

By default urlwatch captures "stderr" for error reporting (non-zero
exit code), but ignores the output when the shell job exits with exit
code 0.

This behavior can be customized using the "stderr" key:

* "ignore": Capture "stderr", report on non-zero exit code, ignore
  otherwise (default)

* "urlwatch": "stderr" of the shell job is sent to "stderr" of the
  "urlwatch" process; any error message on "stderr" will not be
  visible in the error message from the reporter (legacy default
  behavior of urlwatch 2.24 and older)

* "fail": Treat the job as failed if there is *any* output on
  "stderr", even with exit status 0

* "stdout": Merge "stderr" output into "stdout", which means stderr
  output is also considered for the change detection/diff part of
  urlwatch (this is similar to "2>&1" in a shell)

For example, this job definition will make the job appear as failed,
even though the script exits with exit code 0:

   command: |
     echo "Normal standard output."
     echo "Something goes to stderr, which makes this job fail." 1>&2
     exit 0
   stderr: fail

On the other hand, if you want to diff both stdout and stderr of the
job, use this:

   command: |
     echo "An important line on stdout."
     echo "Another important line on stderr." 1>&2
   stderr: stdout


Optional keys for all job types
===============================

* "name": Human-readable name/label of the job

* "filter": Filters (if any) to apply to the output (can be tested
  with "--test-filter")

* "max_tries": Number of times to retry fetching the resource

* "diff_tool": Command to a custom tool for generating diff text

* "diff_filter": Filters (if any) to apply to the diff result (can be
  tested with "--test-diff-filter")

* "treat_new_as_changed": Will treat jobs that don’t have any historic
  data as "CHANGED" instead of "NEW" (and create a diff for new jobs)

* "compared_versions": Number of versions to compare for similarity

* "kind" (redundant): Either "url", "shell" or "browser".
  Automatically derived from the unique key ("url", "command" or
  "navigate") of the job type

* "user_visible_url": Different URL to show in reports (e.g. when
  watched URL is a REST API URL, and you want to show a webpage)


Setting keys for all jobs at once
=================================

The main Configuration file has a "job_defaults" key that can be used
to configure keys for all jobs at once.
