1 Introduction
==============

This chapter is under construction!

   This chapter describes some of the internals of ‘vasm’ and tries to
explain what has to be done to write a cpu module, a syntax module or an
output module for ‘vasm’.  However if someone wants to write one, I
suggest to contact me first, so that it can be integrated into the
source tree.

   Note that this documentation may mention explicit values when
introducing symbolic constants.  This is due to copying and pasting from
the source code.  These values may not be up to date and in some cases
can be overridden.  Therefore do never use the absolute values but
rather the symbolic representations.

2 Building vasm
===============

This section deals with the steps necessary to build the typical ‘vasm’
executable from the sources.

2.1 Directory Structure
-----------------------

The vasm-directory contains the following important files and
directories:
‘vasm/’
     The main directory containing the assembler sources.

‘vasm/Makefile’
     The Makefile used to build ‘vasm’.

‘vasm/syntax/<syntax-module>/’
     Directories for the syntax modules.

‘vasm/cpus/<cpu-module>/’
     Directories for the cpu modules.

‘vasm/obj/’
     Directory the object modules will be stored in.

   All compiling is done from the main directory and the executables
will be placed there as well.  The main assembler for a combination of
‘<cpu>’ and ‘<syntax>’ will be called ‘vasm<cpu>_<syntax>’.  All output
modules are usually integrated in every executable and can be selected
at runtime.

2.2 Adapting the Makefile
-------------------------

Before building anything you have to insert correct values for your
compiler and operating system in the ‘Makefile’.

‘TARGET’
     Here you may define an extension which is appended to the
     executable's name.  Useful, if you build various targets in the
     same directory.

‘TARGETEXTENSION’
     Defines the file name extension for executable files.  Not needed
     for most operating systems.  For Windows it would be ‘.exe’.

‘CC’
     Here you have to insert a command that invokes an ANSI C compiler
     you want to use to build vasm.  It must support the ‘-I’ option the
     same like e.g.  ‘vc’ or ‘gcc’.

‘COPTS’
     Here you will usually define an option like ‘-c’ to instruct the
     compiler to generate an object file.  Additional options, like the
     optimization level, should also be inserted here as well.  When the
     host operating system is different from a Unix (MacOSX and MiNT are
     Unix), you have to define one of the following preprocessor macros:
     ‘-DAMIGA’
          AmigaOS (M68k or PPC), MorphOS, AROS.
     ‘-DATARI’
          Atari TOS.
     ‘-DMSDOS’
          CP/M, MS-DOS, Windows.

‘CCOUT’
     Here you define the option which is used to specify the name of an
     output file, which is usually ‘-o’.

‘LD’
     Here you insert a command which starts the linker.  This may be the
     the same as under ‘CC’.

‘LDFLAGS’
     Here you have to add options which are necessary for linking.  E.g.
     some compilers need special libraries for floating-point.

‘LDOUT’
     Here you define the option which is used by the linker to specify
     the output file name.

‘RM’
     Specify a command to delete a file, e.g.  ‘rm -f’.

   An example for the Amiga using ‘vbcc’ would be:
           TARGET = _os3
           TARGETEXTENSION =
           CC = vc +aos68k
           CCOUT = -o
           COPTS = -c -c99 -cpu=68020 -DAMIGA -O1
           LD = $(CC)
           LDOUT = $(CCOUT)
           LDFLAGS = -lmieee
           RM = delete force quiet

   An example for a typical Unix-installation would be:
           TARGET =
           TARGETEXTENSION =
           CC = gcc
           CCOUT = -o
           COPTS = -c -O2
           LD = $(CC)
           LDOUT = $(CCOUT)
           LDFLAGS = -lm
           RM = rm -f

   Open/Net/Free/Any BSD i386 systems will probably require the
following an additional ‘-D_ANSI_SOURCE’ in ‘COPTS’.

2.3 Building vasm
-----------------

Note to users of Open/Free/Any BSD i386 systems: You will probably have
to use GNU make instead of BSD make, i.e.  in the following examples
replace "make" with "gmake".

   Type:
           make CPU=<cpu> SYNTAX=<syntax>
   For example:
           make CPU=ppc SYNTAX=std

   The following CPU modules can be selected:
   • ‘CPU=6502’
   • ‘CPU=6800’
   • ‘CPU=arm’
   • ‘CPU=c16x’
   • ‘CPU=jagrisc’
   • ‘CPU=m68k’
   • ‘CPU=ppc’
   • ‘CPU=test’
   • ‘CPU=tr3200’
   • ‘CPU=vidcore’
   • ‘CPU=x86’
   • ‘CPU=z80’

   The following syntax modules can be selected:
   • ‘SYNTAX=std’
   • ‘SYNTAX=mot’
   • ‘SYNTAX=madmac’
   • ‘SYNTAX=oldstyle’
   • ‘SYNTAX=test’

   For Windows and various Amiga targets there are already Makefiles
included, which you may either copy on top of the default ‘Makefile’, or
call it explicitely with ‘make’'s ‘-f’ option:
         make -f Makefile.OS4 CPU=ppc SYNTAX=std

3 General data structures
=========================

This section describes the fundamental data structures used in vasm
which are usually necessary to understand for writing any kind of module
(cpu, syntax or output).  More detailed information is given in the
respective sections on writing specific modules where necessary.

3.1 Source
----------

A source structure represents a source text module, which can be either
the main source text, an included file or a macro.  There is always a
link to the parent source from where the current source context was
included or called.

‘struct source *parent;’
     Pointer to the parent source context.  Assembly continues there
     when the current source context ends.

‘int parent_line;’
     Line number in the parent source context, from where we were
     called.  This information is needed, because line numbers are only
     reliable during parsing and later from the atoms.  But an include
     directive doesn't create an atom.

‘char *name;’
     File name of the main source or include file, or macro name.

‘char *text;’
     Pointer to the source text start.

‘size_t size;’
     Size of the source text to assemble in bytes.

‘macro *macro;’
     Pointer to macro structure, when currently inside a macro (see also
     ‘num_params’).

‘unsigned long repeat;’
     Number of repetitions of this source text.  Usually this is 1, but
     for text blocks between a ‘rept’ and ‘endr’ directive it allows any
     number of repetitions, which is decremented everytime the end of
     this source text block is reached.

‘char *irpname;’
     Name of the iterator symbol in special repeat loops which use a
     sequence of arbitrary values, being assigned to this symbol within
     the loop.  Example: ‘irp’ directive in std-syntax.

‘struct macarg *irpvals;’
     A list of arbitrary values to iterate over in a loop.  With each
     iteration the frontmost value is removed from the list until it is
     empty.

‘int cond_level;’
     Current level of conditional nesting while entering this source
     text.  It is automatically restored to the previous level when
     leaving the source prematurely through ‘end_source()’.

‘struct macarg *argnames;’
     The current list of named macro arguments.

‘int num_params;’
     Number of macro parameters passed at the invocation point from the
     parent source.  For normal source files this entry will be -1.  For
     macros 0 (no parameters) or higher.

‘char *param[MAXMACPARAMS];’
     Pointer to the macro parameters.

‘int param_len[MAXMACPARAMS];’
     Number of characters per macro parameter.

‘int num_quals;’
     (If ‘MAX_QUALIFIERS!=0’.)  Number of qualifiers for a macro.  when
     not passed on invocation these are the default qualifiers.

‘char *qual[MAX_QUALIFIERS];’
     (If ‘MAX_QUALIFIERS!=0’.)  Pointer to macro qualifiers.

‘int qual_len[MAX_QUALIFIERS];’
     (If ‘MAX_QUALIFIERS!=0’.)  Number of characters per macro
     qualifier.

‘unsigned long id;’
     Every source has its unique id.  Useful for macros supporting the
     special ‘\@’ parameter.

‘char *srcptr;’
     The current source text pointer, pointing to the beginning of the
     next line to assemble.

‘int line;’
     Line number in the current source context.  After parsing the line
     number of the current atom is stored here.

‘size_t bufsize;’
     Current size of the line buffer (‘linebuf’).  The size of the line
     buffer is extended automatically, when an overflow happens.

‘char *linebuf;’
     A buffer for the current line being assembled in this source text.
     A child-source, like a macro, can refer to arguments from this
     buffer, so every source has got its own.  When returning to the
     parent source, the linebuf is deallocated to save memory.

‘expr *cargexp;’
     (If ‘CARGSYM’ was defined.)  Pointer to the current expression
     assigned to the CARG-symbol (used to select a macro argument) in
     this source instance.  So it can be restored when reentering this
     instance.

‘long reptn;’
     (If ‘REPTNSYM’ was defined.)  Current value of the repetition
     counter symbol in this source instance.  So it can be restored when
     reentering this instance.

3.2 Sections
------------

One of the top level structures is a linked list of sections describing
continuous blocks of memory.  A section is specified by an object of
type ‘section’ with the following members that can be accessed by the
modules:

‘struct section *next;’
     A pointer to the next section in the list.

‘char *name;’
     The name of the section.

‘char *attr;’
     A string describing the section flags in ELF notation (see, for
     example, documentation o the ‘.section’ directive of the standard
     syntax mopdule.

‘atom *first;’
‘atom *last;’
     Pointers to the first and last atom of the section.  See following
     sections for information on atoms.

‘taddr align;’
     Alignment of the section in bytes.

‘uint32_t flags;’
     Flags of the section.  Currently available flags are:
     ‘HAS_SYMBOLS’
          At least one symbol is defined in this section.
     ‘RESOLVE_WARN’
          The current atom changed its size multiple times, so
          atom_size() is now called with this flag set in its section to
          make the backend (e.g.  ‘instruction_size()’) aware of it and
          do less aggressive optimizations.
     ‘UNALLOCATED’
          Section is unallocated, which means it doesn't use any memory
          space in the output file.  Such a section will be removed
          before creating the output file and all its labels converted
          into absolute expression symbols.  Used for "offset" sections.
          Refer to ‘switch_offset_section()’.
     ‘LABELS_ARE_LOCAL’
          As long as this flag is set new labels in a section are
          defined as local labels, with the section name as global
          parent label.
     ‘ABSOLUTE’
          Section is loaded at an absolute address in memory.
     ‘PREVABS’
          Remembers state of the ‘ABSOLUTE’ flag before entering
          relocated-org mode (‘IN_RORG’).  So it can be restored later.
     ‘IN_RORG’
          Section has entered relocated-org mode, which also sets the
          ‘ABSOLUTE’ flag.  In this mode code is written into the
          current section, but relocated to an absolute address.  No
          relocation information are generated.
     ‘NEAR_ADDRESSING’
          Section is marked as suitable for cpu-specific "near"
          addressing modes.  For example, base-register relative.  The
          cpu backend can use this information as an optimization hint
          when referencing symbols from this section.

‘taddr org;’
     Start address of a section.  Usually zero.

‘taddr pc;’
     Current address in this section.  Can be used while traversing
     through the section.  Has to be updated by a module using it.  Is
     set to ‘org’ at the beginning.

‘unsigned long idx;’
     A member usable by the output module for private purposes.

3.3 Symbols
-----------

Symbols are represented by a linked list of type ‘symbol’ with the
following members that can be accessed by the modules:.

‘int type;’
     Type of the symbol.  Available are:
     ‘#define LABSYM 1’
          The symbol is a label defined at a specific location.

     ‘#define IMPORT 2’
          The symbol is imported from another file.

     ‘#define EXPRESSION 3’
          The symbol is defined using an expression.

‘uint32_t flags;’
     Flags of this symbol.  Available are:
     ‘#define TYPE_UNKNOWN 0’
          The symbol has no type information.

     ‘#define TYPE_OBJECT 1’
          The symbol defines an object.

     ‘#define TYPE_FUNCTION 2’
          The symbol defines a function.

     ‘#define TYPE_SECTION 3’
          The symbol defines a section.

     ‘#define TYPE_FILE 4’
          The symbol defines a file.

     ‘#define EXPORT (1<<3)’
          The symbol is exported to other files.

     ‘#define INEVAL (1<<4)’
          Used internally.

     ‘#define COMMON (1<<5)’
          The symbol is a common symbol.

     ‘#define WEAK (1<<6)’
          The symbol is weak, which means the linker may overwrite it
          with any global definition of the same name.  Weak symbols may
          also stay undefined, in which case the linker would assign
          them a value of zero.

     ‘#define LOCAL (1<<7)’
          Only informational.  A symbol can be explicitely declared as
          local by a syntax-module directive.

     ‘#define VASMINTERN (1<<8)’
          Vasm-internal symbol, which is usually not exported into an
          output file.

     ‘#define PROTECTED (1<<9)’
          Used internally to protect the current-PC symbol from
          deletion.

     ‘#define REFERENCED (1<<10)’
          Symbol was referenced in the source and a relocation entry has
          been created.

     ‘#define ABSLABEL (1<<11)’
          Label was defined inside an absolute section, or during
          relocated-org mode.  So it has an absolute address and will
          not generate a relocation entry when being referenced.

     ‘#define EQUATE (1<<12)’
          Symbols flagged as ‘EQUATE’ are constant and its value must
          not be changed.

     ‘#define REGLIST (1<<13)’
          Symbol is a register list definition.

     ‘#define USED (1<<14)’
          Symbol appeared in an expression.  Symbols which were only
          defined, (as label or equte) and never used throughout the
          whole source, don't get this flag set.

     ‘#define NEAR (1<<15)’
          Symbol may be referenced by "near" addressing mode.  For
          example, base register relative.  Used as an optimization hint
          in the cpu backend.

     ‘#define RSRVD_S (1L<<24)’
          The range from bit 24 to 27 (counted from the LSB) is reserved
          for use by the syntax module.

     ‘#define RSRVD_O (1L<<28)’
          The range from bit 28 to 31 (counted from the LSB) is reserved
          for use by the output module.

     The type-flags can be extracted using the ‘TYPE()’ macro which
     expects a pointer to a symbol as argument.

‘char *name;’
     The name of the symbol.

‘expr *expr;’
     The expression in case of ‘EXPRESSION’ symbols.

‘expr *size;’
     The size of the symbol, if specified.

‘section *sec;’
     The section a ‘LABSYM’ symbol is defined in.

‘taddr pc;’
     The address of a ‘LABSYM’ symbol.

‘taddr align;’
     The alignment of the symbol in bytes.

‘unsigned long idx;’
     A member usable by the output module for private purposes.

3.4 Register symbols
--------------------

Optional register symbols are available when the backend defines
‘HAVE_REGSYMS’ in ‘cpu.h’ together with the hash table size.  Example:
     #define HAVE_REGSYMS
     #define REGSYMHTSIZE 256

   A register symbol is defined by an object of type ‘regsym’ with the
following members that can be accessed by the modules:

‘char *reg_name;’
     Symbol name.
‘int reg_type;’
     Optional type of register.
‘unsigned int reg_flags;’
     Optional register symbol flags.
‘unsigned int reg_num;’
     Register number or value.

   Refer to ‘symbol.h’ for functions to create and find register
symbols.

3.5 Atoms
---------

The contents of each section are a linked list built out of
non-separable atoms.  The general structure of an atom is:

     typedef struct atom {
       struct atom *next;
       int type;
       taddr align;
       taddr lastsize;
       unsigned changes;
       source *src;
       int line;
       listing *list;
       union {
         instruction *inst;
         dblock *db;
         symbol *label;
         sblock *sb;
         defblock *defb;
         void *opts;
         int srcline;
         char *ptext;
         printexpr *pexpr;
         expr *roffs;
         taddr *rorg;
         assertion *assert;
         aoutnlist *nlist;
       } content;
     } atom;

   The members have the following meaning:

‘struct atom *next;’
     Pointer to the following atom (0 if last).

‘int type;’
     The type of the atom.  Can be one of
     ‘#define LABEL 1’
          A label is defined here.

     ‘#define DATA 2’
          Some data bytes of fixed length and constant data are put
          here.

     ‘#define INSTRUCTION 3’
          Generally refers to a machine instruction or pseudo/opcode.
          These atoms can change length during optimization passes and
          will be translated to ‘DATA’-atoms later.

     ‘#define SPACE 4’
          Defines a block of data filled with one value (byte).  BSS
          sections usually contain only such atoms, but they are also
          sometimes useful as shorter versions of ‘DATA’-atoms in other
          sections.

     ‘#define DATADEF 5’
          Defines data of fixed size which can contain cpu specific
          operands and expressions.  Will be translated to ‘DATA’-atoms
          later.

     ‘#define LINE 6’
          A source text line number (usually from a high level language)
          is bound to the atom's address.  Useful for source level
          debugging in certain ABIs.

     ‘#define OPTS 7’
          A means to change assembler options at a specific source text
          line.  For example optimization settings, or the cpu type to
          generate code for.  The cpu module has to define
          ‘HAVE_CPU_OPTS’ and export the required functions if it wants
          to use this type of atom.

     ‘#define PRINTTEXT 8’
          A string is printed to stdout during the final assembler pass.
          A newline is automatically appended.

     ‘#define PRINTEXPR 9’
          Prints the value of an expression during the final assembler
          pass to stdout.

     ‘#define ROFFS 10’
          Set the program counter to an address relative to the
          section's start address.  These atoms will be translated into
          ‘SPACE’ atoms in the final pass.

     ‘#define RORG 11’
          Assemble this block under the given base address, while the
          code is still written into the original memory region.

     ‘#define RORGEND 12’
          Ends a RORG block and returns to the original addessing.

     ‘#define ASSERT 13’
          The assertion expression is checked in the final pass and an
          error message is generated (using the expression string and an
          optional message out of this atom) when it evaluates to 0.

     ‘#define NLIST 14’
          Defines a stab-entry for the a.out object file format.
          nlist-style stabs can also occur embedded in other object file
          formats, like ELF.

‘taddr align;’
     The alignment of this atom.  Address must be dividable by ‘align’.

‘taddr lastsize;’
     The size of this atom in the last resolver pass.  When the size has
     changed in the current pass, the assembler will request another
     resolver run through the section.

‘unsigned changes;’
     Number of changes in the size of this atom since pass number
     ‘FASTOPTPHASE’.  An increasing number usually indicates a problem
     in the cpu backend's optimizer and will be flagged by setting
     ‘RESOLVE_WARN’ in the Section flags, as soon as ‘changes’ exceeds
     ‘MAXSIZECHANGES’.  So the backend can choose not to optimize this
     atom as aggressive as before.

‘source *src;’
     Pointer to the source text object to which this atom belongs.

‘int line;’
     The source line number that created this atom.

‘listing *list;’
     Pointer to the listing object to which this atoms belong.

‘instruction *inst;’
     (In union ‘content’.)  Pointer to an instruction structure in the
     case of an ‘INSTRUCTION’-atom.  Contains the following elements:
     ‘int code;’
          The cpu specific code of this instruction.

     ‘char *qualifiers[MAX_QUALIFIERS];’
          (If ‘MAX_QUALIFIERS!=0’.)  Pointer to the qualifiers of this
          instruction.

     ‘operand *op[MAX_OPERANDS];’
          (If ‘MAX_OPERANDS!=0’.)  The cpu-specific operands of this
          instruction.

     ‘instruction_ext ext;’
          (If the cpu module defines ‘HAVE_INSTRUCTION_EXTENSION’.)  A
          cpu-module-specific structure.  Typically used to store
          appropriate opcodes, allowed addressing modes, supported cpu
          derivates etc.

‘dblock *db;’
     (In union ‘content’.)  Pointer to a dblock structure in the case of
     a ‘DATA’-atom.  Contains the following elements:
     ‘taddr size;’
          The number of bytes stored in this atom.

     ‘char *data;’
          A pointer to the data.

     ‘rlist *relocs;’
          A pointer to relocation information for the data.

‘symbol *label;’
     (In union ‘content’.)  Pointer to a symbol structure in the case of
     a ‘LABEL’-atom.

‘sblock *sb;’
     (In union ‘content’.)  Pointer to a sblock structure in the case of
     a ‘SPACE’-atom.  Contains the following elements:
     ‘taddr space;’
          The size of the empty/filled space in bytes.

     ‘expr *space_exp;’
          The above size as an expression, which will be evaluated
          during assembly and copied to ‘space’ in the final pass.

     ‘int size;’
          The size of each space-element and of the fill-pattern in
          bytes.

     ‘unsigned char fill[MAXBYTES];’
          The fill pattern, up to MAXBYTES bytes.

     ‘expr *fill_exp;’
          Optional.  Evaluated and copied to ‘fill’ in the final pass,
          when not null.

     ‘rlist *relocs;’
          A pointer to relocation information for the space.

     ‘taddr maxalignbytes;’
          An optional number of maximum padding bytes to fulfil the
          atom's alignment requirement.  Zero means there is no
          restriction.

‘defblock *defb;’
     (In union ‘content’.)  Pointer to a defblock structure in the case
     of a ‘DATADEF’-atom.  Contains the following elements:
     ‘taddr bitsize;’
          The size of the definition in bits.

     ‘operand *op;’
          Pointer to a cpu-specific operand structure.

‘void *opts;’
     (In union ‘content’.)  Points to a cpu module specific options
     object in the case of a ‘OPTS’-atom.

‘int srcline;’
     (In union ‘content’.)  Line number for source level debugging in
     the case of a ‘LINE’-atom.

‘char *ptext;’
     (In union ‘content’.)  A string to print to stdout in case of a
     ‘PRINTTEXT’-atom.

‘printexpr *pexpr;’
     (In union ‘content’.)  Pointer to a printexpr structure in the case
     of a ‘PRINTEXPR’-atom.  Contains the following elements:
     ‘expr *print_exp;’
          Pointer to an expression to evaluate and print.

     ‘short type;’
          Format type of the printed value.  We can print as hexadecimal
          (‘PEXP_HEX’), signed decimal (‘PEXP_SDEC’), unsigned decimal
          (‘PEXP_UDEC’), binary (‘PEXP_BIN’) OR ASCII (‘PEXP_ASC’).

     ‘short size;’
          Size (precision) of the printed value in bits.  Excessive bits
          will be masked out, and sign-extended when requested.

‘expr *roffs;’
     (In union ‘content’.)  The expression holds the relative section
     offset to align to in case of a ‘ROFFS’-atom.

‘taddr *rorg;’
     (In union ‘content’.)  Assemble the code under the base address in
     ‘rorg’ in case of a ‘RORG’-atom.

‘assertion *assert;’
     (In union ‘content’.)  Pointer to an assertion structure in the
     case of an ‘ASSERT’-atom.  Contains the following elements:
     ‘expr *assert_exp;’
          Pointer to an expression which should evaluate to non-zero.

     ‘char *exprstr;’
          Pointer to the expression as text (to be used in the output).

     ‘char *msgstr;’
          Pointer to the message, which would be printed when
          ‘assert_exp’ evaluates to zero.

‘aoutnlist *nlist;’
     (In union ‘content’.)  Pointer to an nlist structure, describing an
     aout stab entry, in case of an ‘NLIST’-atom.  Contains the
     following elements:
     ‘char *name;’
          Name of the stab symbol.
     ‘int type;’
          Symbol type.  Refer to ‘stabs.h’ for definitions.
     ‘int other;’
          Defines the nature of the symbol (function, object, etc.).
     ‘int desc;’
          Debugger information.
     ‘expr *value;’
          Symbol's value.

3.6 Relocations
---------------

‘DATA’ and ‘SPACE’ atoms can have a relocation list attached that
describes how this data must be modified when linking/relocating.  They
always refer to the data in this atom only.

   There are a number of predefined standard relocations and it is
possible to add other cpu-specific relocations.  Note however, that it
is always preferrable to use standard relocations, if possible.  Chances
that an output module supports a certain relocation are much higher if
it is a standard relocation.

   A relocation list uses this structure:

     typedef struct rlist {
       struct rlist *next;
       void *reloc;
       int type;
     } rlist;

   Type identifies the relocation type.  All the standard relocations
have type numbers between ‘FIRST_STANDARD_RELOC’ and
‘LAST_STANDARD_RELOC’.  Consider ‘reloc.h’ to see which standard
relocations are available.

   The detailed information can be accessed via the pointer ‘reloc’.  It
will point to a structure that depends on the relocation type, so a
module must only use it if it knows the relocation type.

   All standard relocations point to a type ‘nreloc’ with the following
members:
‘size_t byteoffset;’
     Offset in bytes, from the start of the current ‘DATA’ atom, to the
     beginning of the relocation field.  This may also be the address
     which is used as a basis for PC-relative relocations.  Or a common
     basis for several separated relocation fields, which will be
     translated into a single relocation type by the output module.

‘size_t bitoffset;’
     Offset in bits to the beginning of the relocation field, adds to
     ‘byteoffset*bitsperbyte’.  Bits are counted in a bit-stream from
     lower to higher address bytes.  But note, that inside a
     little-endian byte they are counted from the LSB to the MSB, while
     they are counted from the MSB to the LSB for big-endian targets.

‘int size;’
     The size of the relocation field in bits.

‘taddr mask;’
     The mask defines which portion of the relocated value is set by
     this relocation field.

‘taddr addend;’
     Value to be added to the symbol value.

‘symbol *sym;’
     The symbol referred by this relocation

   To describe the meaning of these entries, we will define the steps
that shall be executed when performing a relocation:

  1. Extract the ‘size’ bits from the data atom, starting with bit
     number ‘byteoffset*bitsperbyte+bitoffset’.  We start counting bits
     from the lowest to the highest numbered byte in memory.  Inside a
     big-endian byte we count from the MSB to the LSB. Inside a
     little-endian byte we count from the LSB to the MSB.

  2. Determine the relocation value of the symbol.  For a simple
     absolute relocation, this will be the value of the symbol ‘sym’
     plus the ‘addend’.  For other relocation types, more complex
     calculations will be needed.  For example, in a program-counter
     relative relocation, the value will be obtained by subtracting the
     address of the data atom plus ‘byteoffset’ from the value of ‘sym’
     plus ‘addend’.

  3. Calculate the bit-wise "and" of the value obtained in the step
     above and the ‘mask’ value.

  4. Normalize, i.e.  shift the value above right as many bit positions
     as there are low order zero bits in ‘mask’.

  5. Add this value to the value extracted in step 1.

  6. Insert the low order ‘size’ bits of this value into the data atom
     starting with bit ‘byteoffset*bitsperbyte+bitoffset’.

3.7 Errors
----------

Each module can provide a list of possible error messages contained e.g.
in ‘syntax_errors.h’ or ‘cpu_errors.h’.  They are a comma-separated list
of a printf-format string and error flags.  Allowed flags are ‘WARNING’,
‘ERROR’, ‘FATAL’, ‘MESSAGE’ and ‘NOLINE’.  They can be combined using or
(‘|’).  ‘NOLINE’ has to be set for error messages during initialiation
or while writing the output, when no source text is available.  Errors
cause the assembler to return false.  ‘FATAL’ causes the assembler to
terminate immediately.

   The errors can be emitted using the function ‘syntax_error(int
n,...)’, ‘cpu_error(int n,...)’ or ‘output_error(int n,...)’.  The first
argument is the number of the error message (starting from zero).
Additional arguments must be passed according to the format string of
the corresponding error message.

4 Syntax modules
================

A new syntax module must have its own subdirectory under ‘vasm/syntax’.
At least the files ‘syntax.h’, ‘syntax.c’ and ‘syntax_errors.h’ must be
written.

4.1 The file ‘syntax.h’
-----------------------

‘#define ISIDSTART(x)/ISIDCHAR(x)’
     These macros should return non-zero if and only if the argument is
     a valid character to start an identifier or a valid character
     inside an identifier, respectively.  ‘ISIDCHAR’ must be a superset
     of ‘ISIDSTART’.

‘#define ISBADID(p,l)’
     Even with ‘ISIDSTART’ and ‘ISIDCHAR’ checked, there may be
     combinations of characters which do not form a valid initializer
     (for example, a single character).  This macro returns non-zero,
     when this is the case.  First argument is a pointer to the new
     identifier and second is its length.

‘#define ISEOL(x)’
     This macro returns true when the string pointing at ‘x’ is either a
     comment character or end-of-line.

‘#define CHKIDEND(s,e) chkidend((s),(e))’
     Defines an optional function to be called at the end of the
     identifier recognition process.  It allows you to adjust the length
     of the identifier by returning a modified ‘e’.  Default is to
     return ‘e’.  The function is defined as ‘char *chkidend(char
     *startpos,char *endpos)’.

‘#define BOOLEAN(x) -(x)’
     Defines the result of boolean operations.  Usually this is ‘(x)’,
     as in C, or ‘-(x)’ to return -1 for True.

‘#define NARGSYM "NARG"’
     Defines the name of an optional symbol which contains the number of
     arguments in a macro.

‘#define CARGSYM "CARG"’
     Defines the name of an optional symbol which can be used to select
     a specific macro argument with ‘\.’, ‘\+’ and ‘\-’.

‘#define REPTNSYM "REPTN"’
     Defines the name of an optional symbol containing the counter of
     the current repeat iteration.

‘#define EXPSKIP() s=exp_skip(s)’
     Defines an optional replacement for skip() to be used in expr.c, to
     skip blanks in an expression.  Useful to forbid blanks in an
     expression and to ignore the rest of the line (e.g.  to treat the
     rest as comment).  The function is defined as ‘char *exp_skip(char
     *stream)’.

‘#define IGNORE_FIRST_EXTRA_OP 1’
     Should be defined when the syntax module wants to ignore the
     operand field on instructions without an operand.  Useful, when
     everything following an operand should be regarded as comment,
     without a comment character.

‘#define MAXMACPARAMS 35’
     Optionally defines the maximum number of macro arguments, if you
     need more than the default number of 9.

‘#define SKIP_MACRO_ARGNAME(p) skip_identifier(p)’
     An optional function to skip a named macro argument in the macro
     definition.  Argument is the current source stream pointer.  The
     default is to skip an identifier.

‘#define MACRO_ARG_OPTS(m,n,a,p) NULL’
     An optional function to parse and skip options, default values and
     qualifiers for each macro argument.  Returns ‘NULL’ when no
     argument options have been found.  Arguments are:
     ‘struct macro *m;’
          Pointer to the macro structure being currently defined.
     ‘int n;’
          Argument index, starting with zero.
     ‘char *a;’
          Name of this argument.
     ‘char *p;’
          Current source stream pointer.  An updated pointer will be
          returned.
     Defaults to unused.

‘#define MACRO_ARG_SEP(p) (*p==',' ? skip(p+1) : NULL)’
     An optional function to skip a separator between the macro argument
     names in the macro definition.  Returns NULL when no valid
     separator is found.  Argument is the current source stream pointer.
     Defaults to using comma as the only valid separator.

‘#define MACRO_PARAM_SEP(p) (*p==',' ? skip(p+1) : NULL)’
     An optional function to skip a separator between the macro
     parameters in a macro call.  Returns NULL when no valid separator
     is found.  Argument is the current source stream pointer.  Defaults
     to using comma as the only valid separator.

‘#define EXEC_MACRO(s)’
     An optional function to be called just before a macro starts
     execution.  Parameters and qualifiers are already parsed.  Argument
     is the ‘source’ pointer of the new macro.  Defaults to unused.

4.2 The file ‘syntax.c’
-----------------------

A syntax module has to provide the following elements (all other
funtions should be ‘static’ to prevent name clashes):

‘char *syntax_copyright;’
     A string that will be emitted as part of the copyright message.

‘hashtable *dirhash;’
     A pointer to the hash table with all directives.

‘char commentchar;’
     A character used to introduce a comment until the end of the line.

‘char *defsectname;’
     Name of a default section which vasm creates when a label or code
     occurs in the source, but the programmer forgot to specify a
     section.  Assigning NULL means that there is no default and vasm
     will show an error in this case.

‘char *defsecttype;’
     Type of the default section (see above).  May be NULL.

‘int init_syntax();’
     Will be called during startup, after argument parsing Must return
     zero if initializations failed, non-zero otherwise.

‘int syntax_args(char *);’
     This function will be called with the command line arguments
     (unless they were already recognized by other modules).  If an
     argument was recognized, return non-zero.

‘char *skip(char *);’
     A function to skip whitespace etc.

‘char *skip_operand(char *);’
     A function to skip an instruction's operand.  Will terminate at end
     of line or the next comma, returning a pointer to the rest of the
     line behind the comma.

‘void eol(char *);’
     This function should check that the argument points to the end of a
     line (only comments or whitespace following).  If not, an error or
     warning message should be omitted.

‘char *const_prefix(char *,int *);’
     Check if the first argument points to the start of a constant.  If
     yes return a pointer to the real start of the number (i.e.  skip a
     prefix that may indicate the base) and write the base of the number
     through the pointer passed as second argument.  Return zero if it
     does not point to a number.

‘char *const_suffix(char *,char *);’
     First argument points to the start of the constant (including
     prefix) and the second argument to first character after the
     constant (excluding suffix).  Checks for a constant-suffix and
     skips it.  Return pointer to the first character after that
     constant.  Example: constants with a 'h' suffix to indicate a
     hexadecimal base.

‘void parse(void);’
     This is the main parsing function.  It has to read lines via the
     ‘read_next_line()’ function, parse them and create sections, atoms
     and symbols.  Pseudo directives are usually handled by the syntax
     module.  Instructions can be parsed by the cpu module using
     ‘parse_instruction()’.

‘char *parse_macro_arg(struct macro *,char *,struct namelen *,struct namelen *);’
     Called to parse a macro parameter by using the source stream
     pointer in the second argument.  The start pointer and length of a
     single passed parameter is written to the first ‘struct namelen’,
     while the optionally selected named macro argument is passed in the
     second ‘struct namelen’.  When the ‘len’ field of the second
     ‘namelen’ is zero, then the argument is selected by position
     instead by name.  Returns the updated source stream pointer after
     successful parsing.

‘int expand_macro(source *,char **,char *,int);’
     Expand parameters and special commands inside a macro source.  The
     second argument is a pointer to the current source stream pointer,
     which is updated on any succesful expansion.  The function will
     return the number of characters written to the destination buffer
     (third argument) in this case.  Returning ‘-1’ means: no expansion
     took place.  The last argument defines the space in characters
     which is left in the destination buffer.

‘char *get_local_label(char **);’
     Gets a pointer to the current source pointer.  Has to check if a
     valid local label is found at this point.  If yes return a pointer
     to the vasm-internal symbol name representing the local label and
     update the current source pointer to point behind the label.

     Have a look at the support functions provided by the frontend to
     help.

5 CPU modules
=============

A new cpu module must have its own subdirectory under ‘vasm/cpus’.  At
least the files ‘cpu.h’, ‘cpu.c’ and ‘cpu_errors.h’ must be written.

5.1 The file ‘cpu.h’
--------------------

A cpu module has to provide the following elements (all other functions
should be ‘static’ to prevent name clashes) in ‘cpu.h’:

‘#define MAX_OPERANDS 3’
     Maximum number of operands of one instruction.

‘#define MAX_QUALIFIERS 0’
     Maximum number of mnemonic-qualifiers per mnemonic.

‘#define NO_MACRO_QUALIFIERS’
     Define this, when qualifiers shouldn't be allowed for macros.  For
     some architectures, like ARM, macro qualifiers make no sense.

‘typedef int32_t taddr;’
     Data type to represent a target-address.  Preferrably use the ones
     from ‘stdint.h’.

‘typedef uint32_t utaddr;’
     Unsigned data type to represent a target-address.

‘#define LITTLEENDIAN 1’
‘#define BIGENDIAN 0’
     Define these according to the target endianess.  For CPUs which
     support big- and little-endian, you may assign a global variable
     here.  So be aware of it, and never use ‘#if BIGENDIAN’, but always
     ‘if(BIGENDIAN)’ in your code.

‘#define VASM_CPU_<cpu> 1’
     Insert the cpu specifier.

‘#define INST_ALIGN 2’
     Minimum instruction alignment.

‘#define DATA_ALIGN(n) ...’
     Default alignment for ‘n’-bit data.  Can also be a function.

‘#define DATA_OPERAND(n) ...’
     Operand class for n-bit data definitions.  Can also be a function.
     Negative values denote a floating point data definition of -n bits.

‘typedef ... operand;’
     Structure to store an operand.

‘typedef ... mnemonic_extension;’
     Mnemonic extension.

   Optional features, which can be enabled by defining the following
macros:

‘#define HAVE_INSTRUCTION_EXTENSION 1’
     If cpu-specific data should be added to all instruction atoms.

‘typedef ... instruction_ext;’
     Type for the above extension.

‘#define NEED_CLEARED_OPERANDS 1’
     Backend requires a zeroed operand structure when calling
     ‘parse_operand()’ for the first time.  Defaults to undefined.

‘START_PARENTH(x)’
     Valid opening parenthesis for instruction operands.  Defaults to
     ‘'('’.

‘END_PARENTH(x)’
     Valid closing parenthesis for instruction operands.  Defaults to
     ‘')'’.

‘#define MNEMONIC_VALID(i)’
     An optional function with the arguments ‘(int idx)’.  Returns true
     when the mnemonic with index ‘idx’ is valid for the current state
     of the backend (e.g.  it is available for the selected cpu
     architecture).

‘#define MNEMOHTABSIZE 0x4000’
     You can optionally overwrite the default hash table size defined in
     ‘vasm.h’.  May be necessary for larger mnemonic tables.

‘#define OPERAND_OPTIONAL(p,t)’
     When defined, this is a function with the arguments ‘(operand
     *op,int type)’, which returns true when the given operand type
     (‘type’) is optional.  The function is only called for missing
     operands and should also initialize ‘op’ with default values (e.g.
     0).

   Implementing additional target-specific unary operations is done by
defining the following optional macros:

‘#define EXT_UNARY_NAME(s)’
     Should return True when the string in ‘s’ points to an operation
     name we want to handle.

‘#define EXT_UNARY_TYPE(s)’
     Returns the operation type code for the string in ‘s’.  Note that
     the last valid standard operation is defined as ‘LAST_EXP_TYPE’, so
     the target-specific types will start with ‘LAST_EXP_TYPE+1’.

‘#define EXT_UNARY_EVAL(t,v,r,c)’
     Defines a function with the arguments ‘(int t, taddr v, taddr *r,
     int c)’ to handle the operation type ‘t’ returning an ‘int’ to
     indicate whether this type has been handled or not.  Your operation
     will by applied on the value ‘v’ and the result is stored in ‘*r’.
     The flag ‘c’ is passed as 1 when the value is constant (no
     relocatable addresses involved).

‘#define EXT_FIND_BASE(b,e,s,p)’
     Defines a function with the arguments ‘(symbol **b, expr *e,
     section *s, taddr p)’ to save a pointer to the base symbol of
     expression ‘e’ into the symbol pointer, pointed to by ‘b’.  The
     type of this base is given by an ‘int’ return code.  Further on,
     ‘e->type’ has to checked to be one of the operations to handle.
     The section pointer ‘s’ and the current pc ‘p’ are needed to call
     the standard ‘find_base()’ function.

5.2 The file ‘cpu.c’
--------------------

A cpu module has to provide the following elements (all other functions
and data should be ‘static’ to prevent name clashes) in ‘cpu.c’:

‘int bitsperbyte;’
     The number of bits per byte of the target cpu.

‘int bytespertaddr;’
     The number of bytes per ‘taddr’.

‘mnemonic mnemonics[];’
     The mnemonic table keeps a list of mnemonic names and operand types
     the assembler will match against using ‘parse_operand()’.  It may
     also include a target specific ‘mnemonic_extension’.

‘char *cpu_copyright;’
     A string that will be emitted as part of the copyright message.

‘char *cpuname;’
     A string describing the target cpu.

‘int init_cpu();’
     Will be called during startup, after argument parsing.  Must return
     zero if initializations failed, non-zero otherwise.

‘int cpu_args(char *);’
     This function will be called with the command line arguments
     (unless they were already recognized by other modules).  If an
     argument was recognized, return non-zero.

‘char *parse_cpu_special(char *);’
     This function will be called with a source line as argument and
     allows the cpu module to handle cpu-specific directives etc.
     Functions like ‘eol()’ and ‘skip()’ should be used by the syntax
     module to keep the syntax consistent.

‘operand *new_operand();’
     Allocate and initialize a new operand structure.

‘int parse_operand(char *text,int len,operand *out,int requires);’
     Parses the source at ‘text’ with length ‘len’ to fill the target
     specific operand structure pointed to by ‘out’.  Returns ‘PO_MATCH’
     when the operand matches the operand-type passed in ‘requires’ and
     ‘PO_NOMATCH’ otherwise.  When the source is definitely identified
     as garbage, the function may return ‘PO_CORRUPT’ to tell the
     assembler that it is useless to try matching against any other
     operand types.  Another special case is ‘PO_SKIP’, which is also a
     match, but skips the next operand from the mnemonic table (because
     it was already handled together with the current operand).

‘taddr instruction_size(instruction *ip, section *sec, taddr pc);’
     Returns the size of the instruction ‘ip’ in bytes, which must be
     identical to the number of bytes written by ‘eval_instruction()’
     (see below).

‘dblock *eval_instruction(instruction *ip, section *sec, taddr pc);’
     Converts the instruction ‘ip’ into a DATA atom, including
     relocations, if necessary.

‘dblock *eval_data(operand *op, taddr bitsize, section *sec, taddr pc);’
     Converts a data operand into a DATA atom, including relocations.

‘void init_instruction_ext(instruction_ext *);’
     (If ‘HAVE_INSTRUCTION_EXTENSION’ is set.)  Initialize an
     instruction extension.

‘char *parse_instruction(char *,int *,char **,int *,int *);’
     (If ‘MAX_QUALIFIERS’ is greater than 0.)  Parses instruction and
     saves extension locations.

‘int set_default_qualifiers(char **,int *);’
     (If ‘MAX_QUALIFIERS’ is greater than 0.)  Saves pointers and
     lengths of default qualifiers for the selected CPU and returns the
     number of default qualifiers.  Example: for a M680x0 CPU this would
     be a single qualifier, called "w".  Used by ‘execute_macro()’.

‘cpu_opts_init(section *);’
     (If ‘HAVE_CPU_OPTS’ is set.)  Gives the cpu module the chance to
     write out ‘OPTS’ atoms with initial settings before the first atom
     is generated.

‘cpu_opts(void *);’
     (If ‘HAVE_CPU_OPTS’ is set.)  Apply option modifications from an
     ‘OPTS’ atom.  For example: change cpu type or optimization flags.

‘print_cpu_opts(FILE *,void *);’
     (If ‘HAVE_CPU_OPTS’ is set.)  Called from ‘print_atom()’ to print
     an ‘OPTS’ atom's contents.

6 Output modules
================

Output modules can be chosen at runtime rather than compile time.
Therefore, several output modules are linked into one vasm executable
and their structure differs somewhat from syntax and cpu modules.

   Usually, an output module for some object format ‘fmt’ should be
contained in a file ‘output_<fmt>.c’ (it may use/include other files if
necessary).  To automatically include this format in the build process,
the ‘make.rules’ has to be extended.  The module should be added to the
‘OBJS’ variable at the start of ‘make.rules’.  Also, a dependency line
should be added (see the existing output modules).

   An output module must only export a single function which will return
pointers to necessary data/functions.  This function should have the
following prototype:
     int init_output_<fmt>(
           char **copyright,
           void (**write_object)(FILE *,section *,symbol *),
           int (**output_args)(char *)
         );

   In case of an error, zero must be returned.  Otherwise, It should
perform all necessary initializations, return non-zero and return the
following output parameters via the pointers passed as arguments:

‘copyright’
     A pointer to the copyright string.

‘write_object’
     A pointer to a function emitting the output.  It will be called
     after the assembler has completed and will receive pointers to the
     output file, to the first section of the section list and to the
     first symbol in the symbol list.  See the section on general data
     structures for further details.

‘output_args’
     A pointer to a function checking arguments.  It will be called with
     all command line arguments (unless already handled by other
     modules).  If the output module recognizes an appropriate option,
     it has to handle it and return non-zero.  If it is not an option
     relevant to this output module, zero must be returned.

   At last, a call to the ‘output_init_<fmt>’ has to be added in the
‘init_output()’ function in ‘vasm.c’ (should be self-explanatory).

   Some remarks:

   − Some output modules can not handle all supported CPUs.
     Nevertheless, they have to be written in a way that they can be
     compiled.  If code references CPU-specifics, they have to be
     enclosed in ‘#ifdef VASM_CPU_MYCPU’ ...  ‘#endif’ or similar.

     Also, if the selected CPU is not supported, the init function
     should fail.

   − Error/warning messages can be emitted with the ‘output_error’
     function.  As all output modules are linked together, they have a
     common list of error messages in the file ‘output_errors.h’.  If a
     new message is needed, this file has to be extended (see the
     section on general data structures for details).

   − ‘vasm’ has a mechanism to specify rather complex relocations in a
     standard way (see the section on general data structures).  They
     can be extended with CPU specific relocations, but usually CPU
     modules will try to create standard relocations (sometimes several
     standard relocations can be used to implement a CPU specific
     relocation).  An output module should try to find appropriate
     relocations supported by the object format.  The goal is to avoid
     special CPU specific relocations as much as possible.

   Volker Barthelmann vb@compilers.de

