Home / Tech / AWK As A Major Systems Programming Language — Revisited

AWK As A Major Systems Programming Language — Revisited

AWK As A Major Systems Programming Language — Revisited

AWK As A Major Systems Programming Language — Revisited

Preface

I began this paper in 2013, and in 2015 despatched it out for overview to the folks
listed in a while. After incorporating feedback, I despatched it to Rik Farrow, the
editor of the USENIX journal ;login: to see if he would publish it.
He declined to take action, for fairly good causes.

The paper languished, forgotten, till early 2018 once I got here throughout it and
determined to shine it off, put it up on GitHub, and make it out there from
my house web page in HTML.

If you have an interest in language design and evolution normally, and in Awk
particularly, I hope you’ll take pleasure in studying this paper. If not, then why
are you bothering taking a look at it now?

Arnold Robbins
Nof Ayalon, ISRAEL
March, 2018

1 Introduction

At the March 1991 USENIX convention, Henry Spencer introduced a paper
entitled AWK As A Major Systems Programming Language. In it,
he described his experiences utilizing the unique model of awk
to write down two vital “systems” packages—a clone for an inexpensive
subset of the nroff formatter
1,
and a easy parser generator.

He described what awk did properly, in addition to what it didn’t, and
introduced a listing of issues that awk would wish to accumulate
in an effort to take the place of an inexpensive different to C for
techniques programming duties on Unix techniques.

In specific, awk lies about in the course of the spectrum
between C, which is “close to the metal,” and the shell, which is
fairly high-level. A language at this degree that’s helpful for
doing techniques programming may be very fascinating.

This paper evaluations Henry’s want checklist, and describes among the occasions that
have occurred within the Unix/Linux world since 1991.
It presents a case that gawk—GNU Awk—fills a lot of the
main wants Henry listed means again in 1991, after which describes the
creator’s opinion as to why different languages have efficiently crammed the
techniques programming position which awk didn’t. It discusses
how the present model of gawk might
lastly be capable of be part of the ranks of different widespread, highly effective, scripting
languages in widespread use right this moment, and ends off with some counter-arguments
and the creator’s responses to them.

Acknowledgements

Thanks to Andrew Schorr, Henry Spencer, Nelson H.F. Beebe, and Brian Kernighan
for reviewing an earlier draft of this paper.

2 That Was Then …

In this part we overview the state of the Unix world in
1991, in addition to the state of awk, after which checklist what Henry Spencer
noticed as lacking for awk.

2.1 The Unix World in 1991

Undoubtedly, many readers of this paper weren’t utilizing
computer systems in 1991, so this part supplies the context
by which Henry’s paper was written. In March of 1991:

2.2 What Awk Lacked In 1991

Here is a abstract of what was flawed with the awk image
in 1991. These are in the identical order as introduced Henry’s paper.
We qualify every problem in an effort to later talk about the way it has been addressed
over time.

  1. New awk was not extensively out there. Most Unix distributors
    nonetheless shipped solely outdated awk. (Here is the place he mentions that
    “the independently-available implementations both value substantial
    quantities of cash or include troublesome [sic] licenses.”) His level then
    was that for portability, awk packages needed to be restricted
    to outdated awk.

    This might be thought of a top quality of implementation problem, though
    it’s actually a “lack of available implementation” problem.

  2. There isn’t any solution to inform awk to begin matching all its
    patterns over once more towards the prevailing $zero.
    This is a language design problem.

  3. There isn’t any array task.
    (Language design problem.)

  4. Getting an error message out to straightforward error is tough.
    (Implementation problem.)

  5. There isn’t any exact language specification for awk.
    This results in gratuitous portability issues.
    This too is thus a top quality of implementation problem, in that with out
    a specification, it’s tough to supply uniform, top quality
    implementations.

  6. The present extensively out there implementation is sluggish; a a lot
    quicker implementation is required and the most effective factor of all could be
    an optimizing compiler.
    (Implementation problem.)

  7. There isn’t any awk-level debugger.
    (Support software or high quality of implementation problem.)

  8. There isn’t any awk-level profiler.
    (Support software or high quality of implementation problem.)

In personal e-mail, Henry added the next gadgets, saying
“there are a couple more things I’d add now, in hindsight.”
These are direct quotes:

  1. [I can’t believe I didn’t discuss this in the paper, because I was
    certainly aware of it then!] Lack of any handy mechanism for including
    libraries. When awk is being invoked from a shell file, the
    shell file can do substitutions or use a number of -f choices, however
    these are mechanisms outdoors the language, and never very handy ones.
    What’s actually needed is one thing such as you get in Python and so on., the place one
    little assertion up close to the highest says “prepare for this program to have
    the xyz library out there when it runs.”

  2. I believe it was Rob Pike who later stated (roughly): “It says one thing
    dangerous about Awk that in a language with built-in common expressions,
    you find yourself utilizing substr() so typically.” My paper did allude
    to the issue of discovering out the place one thing matched in
    old-awk packages, however even in new awk, what you get
    is a quantity that you just then must feed to substr(). The language
    may actually use some extra handy means of dissecting a string utilizing
    regexp matching. [Caveat: I have not looked lately at Gawk to see if
    it has one.]

The first of those is someplace between a language design and a
language implementation problem. The latter is a language design problem.

three … And This Is Now

Fast ahead to 2018. Where do issues stand?

three.1 What Awk Has Today

The state of the awk world is a lot better now.
In the identical order:

  1. New awk is the usual model of awk
    right this moment on GNU/Linux, BSD, and industrial Unix techniques.
    The one notable exception is Solaris, the place /usr/bin/awk
    remains to be the outdated one; on all different techniques, plain awk
    is a few model of recent awk.

  2. There stays no solution to inform awk to begin matching all its
    patterns over once more towards the prevailing $zero. Furthermore,
    this can be a characteristic that has not been referred to as for by the awk
    neighborhood, besides in Henry’s paper. (We do acknowledge that this would possibly
    be a helpful characteristic.)

  3. There continues to be no array task.
    However, this perform in gawk, which has arrays of arrays, can do
    the trick properly. It can also be environment friendly, since gawk makes use of
    reference counted strings internally:

    perform copy_array(dest, supply,   i, depend)
    
    
  4. Getting error messages out is less complicated. All fashionable techniques have
    a /dev/stderr particular file to which error messages
    could also be despatched straight.

  5. Perhaps most vital of all, with the
    POSIX customary,
    there’s a formal customary specification
    for awk. As with all formal requirements, it isn’t
    excellent. But it supplies a superb start line, as properly
    as chapter and verse to quote when explaining the conduct
    of a standards-compliant model of awk.

  6. There are quite a few freely out there implementations, with
    completely different licenses, such that everybody ought to have the ability to discover
    an acceptable one:

three.2 And What GNU Awk Has Today

The tougher of the standard of implementation points are addressed
by gawk. In specific:

  1. Beginning with model Four.zero in 2011, gawk supplies an
    awk-level debugger: dgawk, which is
    modeled after GDB.
    This is a full debugger, with breakpoints, watchpoints, single assertion
    stepping and expression analysis capabilities.

  2. gawk has supplied an awk-level assertion profiler for
    a few years (pgawk).
    Although there isn’t a direct correlation with CPU time used, the
    assertion degree profiler stays a robust software for understanding
    program conduct.

  3. Since model Four.zero, gawk has had an ‘@embrace’ facility
    whereby gawk goes and finds the named awk supply
    progrm. For for much longer it has looked for information specified with
    -f alongside the trail named by the AWKPATH atmosphere variable.
    The ‘@embrace’ mechanism additionally makes use of AWKPATH.

  4. In phrases of getting on the items of textual content matched by an everyday expression,
    gawk supplies an optionally available third argument to the match()
    perform. This argument is an array which gawk fills in with each
    the matched textual content for the total regexp and subexpressions, and index and size
    info to be used with substr(). gawk additionally supplies the
    gensub() normal substitution perform, an enhanced model of the
    cut up() perform, and the patsplit() perform for specifying
    contents as a substitute of separators utilizing a regexp.

With the Four.1 launch, all three variations (gawk,
pgawk, and dgawk) are merged right into a single
executable, significantly decreasing the required set up “footprint.”

While gawk has nearly all the time been quicker than Brian
Kernighan’s awk, current efficiency enhancements carry
it nearer to mawk’s efficiency degree (a byte-code based mostly
execution engine and inside enhancements in array indexing).

And gawk clearly has probably the most options of any model,
lots of which significantly improve the facility of the language.

three.three So Where Does Awk Stand?

Despite all the above, gawk is just not as widespread as different
scripting languages. Since 1991, we are able to level to 4 main scripting
languages which have loved, or presently take pleasure in, differing ranges of
recognition: PERL, tcl/tk, Python, and Ruby. We assume it’s truthful to say
that Python and Ruby are the preferred scripting languages within the
second decade of the 21st century.

Is awk, as we’ve described it up thus far, now able to
compete with the opposite languages? Not fairly but.

Four Key Reasons Why Other Languages Have Gained Popularity

In retrospect, it appears clear (at the very least to us!) that there are two
main causes that all the beforehand talked about languages have loved
vital recognition. The first is their extensibility.
The second is namespace administration.

One actually can’t attribute their recognition to improved syntax.
In the opinion of many, PERL and Ruby each endure from horrible syntax.
Tcl’s syntax is readable however nothing particular. Python’s syntax is elegant,
though barely uncommon. The level right here is that all of them differ enormously
in syntax, and none actually provides the clear sample–motion paradigm
that’s awk’s trademark, but they’re all widespread languages.

If not syntax, then what? We imagine that their recognition stems from
the truth that all of those languages are simply extensible. This is true
with each “modules” within the scripting language, and extra importantly,
with entry to C degree amenities through dynamic library loading.

Furthermore, these languages permit you to group associated capabilities and
variables into packages or modules: they allow you to handle the namespace.

awk, then again, has all the time been closed. An awk
program can’t even change its working listing, a lot much less open
a connection to an SQL database or a socket to a server on the
Internet someplace (though gawk can do the latter).

If one examines the variety of extensions out there for PERL on CPAN,
or for Python akin to PyQt or the Python tk bindings, it turns into
clear that extensibility is the true key to energy (and from there
to recognition).

Furthermore, in awk, all world variables and capabilities
share a single namespace. This prevents many good software program improvement
practices based mostly on the precept of data hiding.

To summarize: A cheap language definition, environment friendly implementations,
debuggers and profilers are mandatory however not enough for true energy.
The closing substances are extensibility and namespaces.

5 Filling The Extensibility Gap

With model Four.1, gawk (lastly) supplies an outlined C API
for extending the core language.

5.1 API Overview

The API makes it attainable to write down capabilities in C or C++ which might be
callable from an awk program as if the perform had been
written in awk. The most simple solution to assume
of those capabilities is as user-defined capabilities that occur to be
carried out in a unique language.

The API supplies the next amenities:

  • Structures that map awk string, numeric, and undefined values
    into C varieties that may be labored with.

  • Management of perform parameters, together with the flexibility to transform
    a parameter whose authentic kind is undefined, into an array. That is,
    there may be full call-by-reference for arrays. Scalars are handed by
    worth, in fact.

  • Access to the image desk. Extension capabilities can learn all awk
    variables, and create and replace new variables. As an preliminary, comparatively
    arbitrary design choice, extensions can’t replace particular variables akin to
    NR or NF, with the only exception of PROCINFO.

  • Full array administration, together with the flexibility to create arrays, and arrays
    of arrays, and the flexibility so as to add and delete components from an array. It
    can also be attainable to “flatten” an array into an information construction that
    makes it easy for C code to loop over all the weather of an array.

  • The capability to run a process when gawk exits. This is conceptually
    the identical because the C atexit() perform.

  • Hooks into the built-in I/O redirection mechanisms in gawk.
    In specific, there are separate amenities for enter redirections
    with getline and ‘<’, output redirections with
    print or printf and ‘>’ or ‘>>’, and two-way
    pipelines with gawk’s ‘|&’ operator.

5.2 Discussion

Considerable thought went into the design of the API.
The gawk documentation supplies a
full description of the API itself,
with examples (over 50 pages value!), in addition to
some dialogue of the objectives and design choices
behind the API (in an appendix).
The improvement was completed over the course of
a few 12 months and a half, along with the builders of xgawk,
a fork of gawk that added options that made utilizing extensions
simpler, and included an extension for processing XML information in a means that
match naturally with the sample–motion paradigm. While it might not be
excellent, the gawk builders really feel that it’s a good begin.

FIXME: Henry Spencer suggests including extra data on the API and
on the design choices.
I believe this paper is lengthy sufficient, and the total doc is kind of large.
It’d be exhausting to tug API doc into this paper in an inexpensive trend,
though it will be attainable to overview among the design choices.
Comments?

The main xgawk additions to the C code base have been merged
into gawk, and the extensions from that venture have been
rewritten to make use of the brand new API. As a consequence, the xgawk venture
builders renamed their venture gawkextlib, and the venture now
supplies solely extensions.5

It is notable that capabilities written in awk can do a quantity
of issues that extension capabilities can’t, akin to modify any
variables, do I/O, name awk built-in capabilities,
and name different user-defined capabilities.

While it will actually be attainable to offer APIs for all of those
options for extension capabilities, this gave the impression to be overkill. Instead,
the gawk builders took the view that extension capabilities
ought to present entry to exterior amenities, and supply communication
to the awk degree through perform parameters and/or world variables,
together with associative arrays, that are the one actual knowledge construction.

Consider a easy instance. The customary du program
can recursively stroll a number of arbitrary file hierarchies, name
stat() to retrieve file info, after which sum up the blocks
used. In the method, du should monitor exhausting hyperlinks, in order that no
file is accounted for or reported greater than as soon as.

The ‘filefuncs’ extension shipped with gawk supplies a
stat() perform that takes a pathname and fills in an associative
array with the knowledge retrieved from stat(). The array
components have names like "size", "mtime" and so forth, with
corresponding applicable values. (Compare this to PERL’s stat()
perform that returns a linearly-indexed array!)

The fts() perform within the ‘filefuncs’ extension builds on
stat() to create a multidimensional array of arrays that describes
the requested file hierarchies, with every component being an array crammed
in by stat(). Directories are arrays containing components for every
listing entry, with a component named "." for the array itself.

Given that fts() does the heavy lifting, du will be
written fairly properly, and fairly portably6, in awk. See du in awk, for the
code, which weighs in at below 250 traces. Much of that is feedback and
argument parsing.

5.three Future Work

The extension facility is comparatively new, and undoubtedly has launched new
“dark corners” into gawk. These stay to be uncovered
and any new bugs should be shaken out and eliminated.

Some points are recognized and might not be resolvable. For instance, 64-bit
integer values such because the timestamps in stat() knowledge on fashionable
techniques don’t match into awk’s 64-bit double-precision
numbers which solely have 53 bits of significand. This can also be a
downside for the bit-manipulation capabilities.

With respect to namespaces, in 2017 I (lastly) discovered how
namespaces in awk should work to offer the wanted
performance whereas retaining backwards compatibility.
The code is presently within the characteristic/namespaces department
of gawk’s Git repository. It will ultimately be merged
into the grasp department for launch as a part of gawk 5.zero.

FIXME: More data wanted right here.

6 Counterpoints

Brian Kernighan raised a number of counterpoints in response to
an earlier draft of the paper. They are value addressing (or
at the very least attempting to):

I’m not 100% satisfied by your fundamental premise, that the dearth of an
extension mechanism is the primary / an enormous motive why Awk isn’t used for
the sorts of system programming duties that Perl, Python, and so on., are.
It’s completely an element—with out such a mechanism, there’s simply no
solution to do quite a lot of vital computations. But how does that commerce off
towards simply having built-in mechanisms for the core system programming
amenities (as Perl does) or a handful of core libraries like sys,
os, regex, and so on., for Python?

I believe that Perl’s authentic inclusion of a lot of the Unix system calls
was, from a language design standpoint, finally a mistake. At
the time it was first completed, there was no different alternative: dynamic loading
of libraries didn’t exist on Unix techniques within the early and mid-1980s
(nor did shared libraries, for that matter). But having all these
built-in capabilities bloats the language, making it more durable to study,
doc, and preserve, and I positively didn’t want to go down that
path for gawk.

With respect to Python, the query is: how are these libraries
carried out? Are they built-in to the interpreter and separated from the
“core” language just by the language design? Or are they dynamically
loaded modules?

If the latter, that seems like an argument for the case of getting
extensions, not towards it. And certainly, this merely emphasizes the
level made on the finish of the earlier part, which is that to make an
extension facility actually scalable, you additionally want some type of namespace /
module functionality.

Thus, Brian is right: an extension facility is required, however the
final a part of the puzzle could be a module facility within the language.
I believe that I’ve solved this, and invite the curious reader to
checkout the department named earlier and supply suggestions.

I’m additionally not satisfied that Awk is the precise language for writing
issues that want extensions. It was initially designed for 1-liners,
and quite a lot of its constructs don’t scale as much as larger packages. The
notation for perform locals is appalling (all my fault too, which makes
it worse). There’s little likelihood to get well from random spelling
errors and typos; the usage of mere adjacency for concatenation seems to be
ever extra like a nasty thought.

This is tough to argue with. Nonetheless, gawk’s –lint
possibility could also be of assist right here, in addition to the –dump-variables
possibility which produces a listing of all variables utilized in this system.

Awk is ok for its authentic objective, however I discover myself writing Python
for something that’s going to be larger than say 10-20 traces except the
traces are principally simply longer pattern-action sequences. (That
notation is a win, in fact, which you level out.)

Since my Python expertise is minimal, I’ve little to say right here;
it may be that if I had been extra aware of Python, I might begin
utilizing it for small scripts as a substitute of awk.

On the opposite hand, with self-discipline, it’s attainable to write down pretty
good-sized, comprehensible and maintainable awk packages;
in my expertise awk does scale up properly past the one-liner
vary.

Not to say that Brian printed a complete guide of awk
packages bigger than one line. :-) (See the Resources part.)

Some of my very own, good-sized awk packages can be found
from GitHub:

The TexiWeb Jr. literate programming system

See https://github.com/arnoldrobbins/texiwebjr.
The suite has two packages that whole over 1,300 traces
of awk. (They share some code.)

Prepinfo

See https://github.com/arnoldrobbins/prepinfo.
This script processes Texinfo information, updating menus
as wanted. This model is rewritten in TexiWeb Jr.; it’s
about 350 traces of awk.

Sortmail

See https://github.com/arnoldrobbins/sortmail.
This script kinds a Unix mbox format mailbox by thread.
I exploit it every day. It’s additionally written in TexiWeb Jr. and
is about 330 traces of awk.

Brian continues:

The du instance is terribly large, although it does showcase among the
language options. Could you get the identical mileage with one thing
fairly a bit shorter?

My definition of “small” and “big” has modified over time. 250 traces
could also be large for a script, however the du.awk program is far smaller
than a full implementation in C: GNU du is over 1,100 traces
of C, plus all of the libraries it depends upon within the GNU Coreutils.

With respect to shorter examples, nothing springs to thoughts instantly.
However, gawk comes with a number of helpful extensions that
are value exploring, rather more than we’ve lined right here.

For instance, the readdir extension within the gawk
distribution causes gawk to learn directories and return one
report per listing entry in an easy-to-parse format:

$ gawk -lreaddir '' .
-| 2109292/mail.mbx/f
-| 2109295/awk-sys-prog.texi/f
-| 2100007/./d
-| 2100056/texinfo.tex/f
-| 2100055/cleanit/f
-| 2109282/awk-sys-prog.pdf/f
-| 2100009/du.awk/f
-| 2100010/.git/d
-| 2098025/../d
-| 2109294/ChangeLog/f

How cool is that?!? :-)

Also, the gawkextlib venture supplies some very attention-grabbing
extensions. Of specific curiosity are the XML and JSON extensions,
however there are a selection of others, and it’s value testing.

In quick, it’s too early to essentially inform. This is the start of
an experiment. I hope it is going to be a enjoyable journey for me, the opposite
gawk maintainers, and the bigger neighborhood of awk
customers.

7 Conclusion

It has taken for much longer than any awk fan would really like, however lastly,
GNU Awk fills in nearly all of the gaps listed by Henry Spencer for
awk to be actually helpful as a techniques programming language.

In addition, expertise from different widespread languages has proven that
extensibility and namespaces are the keys to true energy,
usability, and recognition.

With the discharge of gawk Four.1, we really feel that gawk
(and thus the Awk language) are actually nearly on par with the essential capabilities
of different widespread languages. With gawk 5.zero, we hope to actually
attain par.

Is it too late within the recreation?
If sufficient folks begin to write extensions for gawk,
then maybe awk will return to the scripting language limelight.
If not, then the gawk builders may have wasted a
lot of effort and time. (We hope not!) Time will inform.

For now although, we hope that this paper may have piqued your
curiosity, and that you’ll take the time to provide gawk
a recent look.

Appendix A Resources

  1. The AWK Programming Language Paperback,
    Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger.
    Addison-Wesley, 1988.
    ISBN-13: 978-0201079814, ISBN-10: 020107981X.

  2. Effective awk Programming, fourth version.
    Arnold Robbins.
    O’Reilly Media, 2015.
    ISBN-13: 978-1491904619, ISBN-10: 1491904615.

  3. Online model of the gawk documentation:
    http://www.gnu.org/software program/gawk/guide/.

  4. The gawkextlib venture:
    http://sourceforge.internet/tasks/gawkextlib/.


Appendix B Awk Code For du

Here is the du program, written in Awk.
Besides demonstrating the facility of the stat() and fts()
extensions and gawk’s multidimensional arrays,
it additionally exhibits the swap assertion and the built-in
bit manipulation capabilities and(), or(), and compl().

The output is just not similar to GNU du’s, since filenames are
not sorted. However, gawk’s built-in sorting amenities
ought to make sorting the output simple; we depart that because the
conventional “exercise for the reader.”

#! /usr/native/bin/gawk -f

# du.awk --- write POSIX du utility in awk.
# See http://pubs.opengroup.org/onlinepubs/9699919799/utilities/du.html
#
# Most of the heavy lifting is completed by the fts() perform within the "filefuncs"
# extension.
#
# We assume this conforms to POSIX, aside from the default block dimension, which
# is about to 1024. Following GNU requirements, set POSIXLY_CORRECT within the
# atmosphere to power 512-byte blocks.
#
# Arnold Robbins
# arnold@skeeve.com

@embrace "getopt"
@load "filefuncs"

BEGIN 

# utilization --- print a message and die

perform utilization()
-L] [file] ..." > "/dev/stderr"
    exit 1


# compute_scale --- compute the size issue for block dimension calculations

perform compute_scale(     stat_info, blocksize)

    stat(".", stat_info)

    if (! ("devbsize" in stat_info)) 
        printf("du.awk: you must be using filefuncs extension from gawk 4.1.1 or latern") > "/dev/stderr"
        exit 1
    

    # Use "devbsize", which is the items for the depend of blocks
    # in "blocks".
    blocksize = stat_info["devbsize"]
    if (blocksize > BLOCK_SIZE)
        SCALE = blocksize / BLOCK_SIZE
    else    # I can not actually think about this could be true
        SCALE = BLOCK_SIZE / blocksize


# islinked --- return true if a file has been seen already

perform islinked(stat_info,        machine, inode, ret)

    machine = stat_info["dev"]
    inode = stat_info["ino"]

    ret = ((machine, inode) in Files_seen)

    return ret


# file_blocks --- return variety of blocks if a file has not been seen but

perform file_blocks(stat_info,     machine, inode)

    if (islinked(stat_info))
        return zero

    machine = stat_info["dev"]
    inode = stat_info["ino"]

    Files_seen[device, inode]++

    return block_count(stat_info)   # delegate precise counting


# block_count --- return variety of blocks from a stat() consequence array

perform block_count(stat_info,     consequence)


# sum_dir --- knowledge on a single listing

perform sum_dir(listing, do_print,   i, sum, depend)


# simple_walk --- summarize directories --- print data per parameter

perform simple_walk(filedata, do_print,    i, sum, path)

    for (i in filedata) 
        if ("." in filedata[i])  else 
        printf("%dt%sn", sum, path)
    


# sum_walk --- summarize directories --- print data just for the highest set of directories

perform sum_walk(filedata)

    simple_walk(filedata, FALSE)


# top_walk --- knowledge on the primary arguments solely

perform top_walk(filedata)


# all_walk --- knowledge on each file

perform all_walk(filedata, i, sum, depend)
{
    for (i in filedata) 
        if ("." in filedata[i])  else             # common file
            if (! islinked(filedata[i]["stat"])) 
        
    
    return sum
}

About Agent

Check Also

What Jihadists Are Saying About the Coronavirus

What Jihadists Are Saying About the Coronavirus Jihadist teams are carefully following the unfold of …

Leave a Reply

Your email address will not be published. Required fields are marked *