User:Jorend/Python set usage

From MozillaWiki
Jump to: navigation, search

Abstract

A Set() constructor is proposed for a future edition of the ECMAScript programming language standard. Should the constructor take a single iterable argument or any number of arguments? One factor in such a design decision is usefulness. Variable arguments are nicer when the number of values to be added to the set is known at the call site, as in Set(1, 2, 3) vs. Set([1, 2, 3]). A single iterable argument is nicer when the number of values is not statically known, as in Set(names) vs. Set(...names). Which need is more common?

A script was used to search several Python codebases (including the Python standard library) for expressions that create nonempty sets. Of these expressions, the number of values to be added to the new set is statically apparent in less than half (41% in the standard library, 26.8% overall).

Background

The Python programming language offers a mutable, unordered set data structure, implemented using a hash table. There are several ways to create one:

  • Set literal syntax. {'426', '225', '226'} is a set of three values. {a, b} is a set containing both a and b; it is of size 2 unless it happens that a == b, in which case it is of size 1. Setting aside duplicates, the number of values in a set literal is always statically apparent.
  • Set comprehension syntax. Python has syntax based on the mathematical “set-builder” notation. In mathematics, the set of all ordered pairs of primes may be written: { <a, b> | a ∈ P, b ∈ P }. The corresponding Python would be prime_pairs = { (a, b) for a in P for b in P} though of course in Python, P would have to be finite or the program would never finish. The number of values in a set comprehension is not usually statically apparent in practice.
  • Calling the set constructor. The constructor takes an optional iterable argument which provides the values. The number of values in a call to set is sometimes statically apparent, as in set([SUCCESS, FAILURE]), and sometimes not, as in set(self.get_labels()).

Method

A copy of the Python standard library and a corresponding Python executable were obtained using the commands:

hg clone http://hg.python.org/cpython
cd cpython
./configure
make

(hg is Mercurial; it can be installed on most Linux systems using something like sudo apt-get install mercurial.)

(On Mac OS X Lion, configure with CC=gcc-apple-4.2 ./configure instead. Apple’s LLVM-based gcc that ships with Lion can’t build Python.)

The script below was then run in the Lib directory. The script searches the current directory for Python files. It disregards unit tests, as they typically contain very unusual code. It parses each other file using the Python parser and finds each set-creation expression. The expression set(), which creates an empty set, is disregarded; the rest are classified as one of the following:

  • A - statically known number of values (supporting variable arguments)
  • B - not statically known number of values (supporting a single iterable argument)
  • C - using a comprehension (also supporting a single iterable argument, assuming ECMAScript gets array comprehensions)
from __future__ import division
import os, ast, collections

def target_files(dir):
    for dirpath, dirnames, filenames in os.walk(dir):
        dirnames[:] = [d for d in dirnames if d not in ('test', 'tests', '_tests')]
        for f in filenames:
            if f.endswith('.py') and not f.endswith('Tests.py'):
                yield os.path.join(dirpath, f)

def main():
    counts = collections.defaultdict(int)
    loc = 0
    for filename in target_files('.'):
        with open(filename) as f:
            src = f.read()
        lines = src.splitlines()
        loc += len(lines)
        parse_tree = ast.parse(src, filename)

        def log(node, code):
            counts[code] += 1
            print('{0} {1}:{2}:{3}'.format(code, filename, node.lineno, lines[node.lineno-1]))

        for node in ast.walk(parse_tree):
            if isinstance(node, ast.Call):
                if isinstance(node.func, ast.Name) and node.func.id == 'set':
                    assert len(node.args) in (0, 1)
                    assert len(node.keywords) == 0
                    assert node.starargs is None
                    assert node.kwargs is None
                    if len(node.args) == 1:
                        arg = node.args[0]
                        if isinstance(arg, (ast.Str, ast.List, ast.Tuple)):
                            log(node, 'A')
                        elif isinstance(arg, (ast.ListComp, ast.SetComp, ast.GeneratorExp)):
                            log(node, 'C')
                        else:
                            log(node, 'B')
            elif isinstance(node, ast.Set):
                log(node, 'A')
            elif isinstance(node, ast.SetComp):
                log(node, 'C')

    total = sum(counts.values())
    print()
    for t, n in sorted(counts.items()):
        print("Type {0}: {1} ({2:.1%})".format(t, n, n / total))
    print("total: {0}".format(total))
    print("set expressions per 1000 lines of code: {0}".format(1000 * total / loc))

main()

Results

The full output from parsing the Python standard library is shown below. There are 117 qualifying set-creation expressions in the standard library. Of these:

  • 48 (41.0%) are type A (supporting a constructor with variable arguments),
  • 59 (50.4%) are type B (supporting a constructor with a single iterable argument), and
  • 10 (8.5%) are type C (comprehensions, supporting a constructor with a single iterable argument)

The numbers for all codebases are:

Codebase A B C total density
standard library 48 (41.0%) 59 (50.4%) 10 (8.5%) 117 0.45
Django 9 (13.6%) 36 (54.5%) 21 (31.8%) 66 0.65
Mercurial 23 (14.0%) 110 (67.1%) 31 (18.9%) 164 2.24
MoinMoin 21 (72.4%) 7 (24.1%) 1 (3.4%) 29 0.76
scons 2 (28.6%) 5 (71.4%) 0 7 0.11
Total 103 (26.8%) 217 (56.6%) 63 (16.4%) 383 -

(The density column tells how many qualifying set-expressions each project has per 1000 lines of code.)

(I also ran the analysis on Zope, but it apparently does not use sets at all.)

A ./Lib/_markupbase.py:152:        if sectName in {"temp", "cdata", "ignore", "include", "rcdata"}:
A ./Lib/_markupbase.py:155:        elif sectName in {"if", "else", "endif"}:
A ./Lib/_markupbase.py:210:                if name not in {"attlist", "element", "entity", "notation"}:
A ./Lib/_markupbase.py:129:                elif decltype in {"attlist", "linktype", "link", "element"}:
B ./Lib/_pyio.py:158:    modes = set(mode)
A ./Lib/_pyio.py:159:    if modes - set("axrwb+tU") or len(mode) > len(modes):
C ./Lib/_weakrefset.py:175:        return self.data <= set(ref(item) for item in other)
C ./Lib/_weakrefset.py:182:        return self.data >= set(ref(item) for item in other)
C ./Lib/_weakrefset.py:187:        return self.data == set(ref(item) for item in other)
C ./Lib/abc.py:132:        abstracts = {name
A ./Lib/abc.py:189:        return any(cls.__subclasscheck__(c) for c in {subclass, subtype})
B ./Lib/ast.py:60:            return set(map(_convert, node.elts))
B ./Lib/bdb.py:22:        self.skip = set(skip) if skip else None
C ./Lib/cgi.py:627:        return list(set(item.name for item in self.list))
B ./Lib/cmd.py:287:        commands = set(self.completenames(*args))
C ./Lib/cmd.py:288:        topics = set(a[5:] for a in self.get_names()
A ./Lib/difflib.py:1221:            if tag in {'replace', 'delete'}:
A ./Lib/difflib.py:1224:            if tag in {'replace', 'insert'}:
A ./Lib/difflib.py:1305:        if any(tag in {'replace', 'delete'} for tag, _, _, _, _ in group):
A ./Lib/difflib.py:1314:        if any(tag in {'replace', 'insert'} for tag, _, _, _, _ in group):
A ./Lib/ftplib.py:931:    if treply[:3] not in {'125', '150'}: raise error_proto  # RFC 959
A ./Lib/ftplib.py:933:    if sreply[:3] not in {'125', '150'}: raise error_proto  # RFC 959
A ./Lib/ftplib.py:178:        if s[:5] in {'pass ', 'PASS '}:
A ./Lib/ftplib.py:228:        if c in {'1', '2', '3'}:
A ./Lib/ftplib.py:252:        if resp[:3] not in {'426', '225', '226'}:
A ./Lib/ftplib.py:570:        if resp[:3] in {'250', '200'}:
A ./Lib/ftplib.py:388:        if user == 'anonymous' and passwd in {'', '-'}:
A ./Lib/ftplib.py:819:            if resp[:3] not in {'426', '225', '226'}:
B ./Lib/hashlib.py:59:algorithms_guaranteed = set(__always_supported)
B ./Lib/hashlib.py:60:algorithms_available = set(__always_supported)
B ./Lib/inspect.py:994:    possible_kwargs = set(args + kwonlyargs)
B ./Lib/mailbox.py:1614:        flags = set(flags)
B ./Lib/mailbox.py:1110:            all_keys = set(self.keys())
B ./Lib/mailbox.py:1646:            flags = set(self.get_flags())
B ./Lib/mailbox.py:1733:            sequences = set(self.get_sequences())
B ./Lib/mailbox.py:1823:            labels = set(self.get_labels())
B ./Lib/mailbox.py:1547:            flags = set(self.get_flags())
B ./Lib/mailbox.py:1744:            sequences = set(self.get_sequences())
B ./Lib/mailbox.py:1836:            labels = set(self.get_labels())
B ./Lib/mailbox.py:1141:                for key in sorted(set(keys)):
B ./Lib/mailbox.py:1511:        self.set_flags(''.join(set(self.get_flags()) | set(flag)))
B ./Lib/mailbox.py:1511:        self.set_flags(''.join(set(self.get_flags()) | set(flag)))
B ./Lib/mailbox.py:1560:            flags = set(self.get_flags())
B ./Lib/mailbox.py:1636:        self.set_flags(''.join(set(self.get_flags()) | set(flag)))
B ./Lib/mailbox.py:1636:        self.set_flags(''.join(set(self.get_flags()) | set(flag)))
B ./Lib/mailbox.py:1669:            flags = set(self.get_flags())
B ./Lib/mailbox.py:1846:            labels = set(self.get_labels())
B ./Lib/mailbox.py:1516:            self.set_flags(''.join(set(self.get_flags()) - set(flag)))
B ./Lib/mailbox.py:1516:            self.set_flags(''.join(set(self.get_flags()) - set(flag)))
B ./Lib/mailbox.py:1568:            flags = set(self.get_flags())
B ./Lib/mailbox.py:1641:            self.set_flags(''.join(set(self.get_flags()) - set(flag)))
B ./Lib/mailbox.py:1641:            self.set_flags(''.join(set(self.get_flags()) - set(flag)))
B ./Lib/mailbox.py:1679:            flags = set(self.get_flags())
B ./Lib/mailbox.py:1757:            sequences = set(self.get_sequences())
A ./Lib/netrc.py:74:                    tt in {'', 'machine', 'default', 'macdef'}):
A ./Lib/nntplib.py:124:_LONGRESP = {
A ./Lib/pydoc.py:166:    if name in {'__builtins__', '__doc__', '__file__', '__path__',
A ./Lib/pydoc.py:2022:            if modname in {'test.badsyntax_pep3120', 'badsyntax_pep3120'}:
B ./Lib/shutil.py:204:        return set(ignored_names)
B ./Lib/site.py:84:    for m in set(sys.modules.values()):
A ./Lib/socket.py:233:_blocking_errnos = { EAGAIN, EWOULDBLOCK }
A ./Lib/socket.py:152:            if c not in {"r", "w", "b"}:
A ./Lib/sre_compile.py:27:_LITERAL_CODES = set([LITERAL, NOT_LITERAL])
A ./Lib/sre_compile.py:28:_REPEATING_CODES = set([REPEAT, MIN_REPEAT, MAX_REPEAT])
A ./Lib/sre_compile.py:29:_SUCCESS_CODES = set([SUCCESS, FAILURE])
A ./Lib/sre_compile.py:30:_ASSERT_CODES = set([ASSERT, ASSERT_NOT])
A ./Lib/sre_parse.py:22:DIGITS = set("0123456789")
A ./Lib/sre_parse.py:24:OCTDIGITS = set("01234567")
A ./Lib/sre_parse.py:25:HEXDIGITS = set("0123456789abcdefABCDEF")
A ./Lib/sre_parse.py:27:WHITESPACE = set(" \t\n\r\v\f")
A ./Lib/sre_parse.py:381:_PATTERNENDERS = set("|)")
A ./Lib/sre_parse.py:382:_ASSERTCHARS = set("=!<")
A ./Lib/sre_parse.py:383:_LOOKBEHINDASSERTCHARS = set("=!")
A ./Lib/sre_parse.py:384:_REPEATCODES = set([MIN_REPEAT, MAX_REPEAT])
B ./Lib/stringprep.py:19:b1_set = set([173, 847, 6150, 6155, 6156, 6157, 8203, 8204, 8205, 8288, 65279] + list(range(65024,65040)))
B ./Lib/stringprep.py:220:c22_specials = set([1757, 1807, 6158, 8204, 8205, 8232, 8233, 65279] + list(range(8288,8292)) + list(range(8298,8304)) + list(range(65529,65533)) + list(range(119155,119163)))
B ./Lib/stringprep.py:247:c6_set = set(range(65529,65534))
B ./Lib/stringprep.py:252:c7_set = set(range(12272,12284))
B ./Lib/stringprep.py:257:c8_set = set([832, 833, 8206, 8207] + list(range(8234,8239)) + list(range(8298,8304)))
B ./Lib/stringprep.py:262:c9_set = set([917505] + list(range(917536,917632)))
B ./Lib/subprocess.py:1280:                    fds_to_keep = set(pass_fds)
B ./Lib/sysconfig.py:724:                archs = tuple(sorted(set(archs)))
A ./Lib/tarfile.py:129:PAX_NAME_FIELDS = {"path", "linkpath", "uname", "gname"}
B ./Lib/trace.py:133:        self._mods = set() if not modules else set(modules)
B ./Lib/collections/abc.py:421:        return set(it)
B ./Lib/collections/abc.py:437:        return set(it)
C ./Lib/concurrent/futures/_base.py:192:        finished = set(
C ./Lib/concurrent/futures/_base.py:254:        done = set(f for f in fs
B ./Lib/concurrent/futures/_base.py:195:        pending = set(fs) - finished
B ./Lib/concurrent/futures/_base.py:256:        not_done = set(fs) - done
B ./Lib/concurrent/futures/_base.py:275:    return DoneAndNotDoneFutures(done, set(fs) - done)
A ./Lib/distutils/msvc9compiler.py:255:    interesting = set(("include", "lib", "libpath", "path"))
B ./Lib/distutils/util.py:153:                archs = tuple(sorted(set(archs)))
A ./Lib/idlelib/CodeContext.py:18:BLOCKOPENERS = set(["class", "def", "elif", "else", "except", "finally", "for",
A ./Lib/idlelib/EditorWindow.py:1614:    if (not keylist) or (macosxSupport.runningAsOSXApp() and eventname in {
C ./Lib/idlelib/MultiCall.py:127:        substates = list(set(state & x for x in states))
A ./Lib/lib2to3/fixer_util.py:167:consuming_calls = set(["sorted", "list", "set", "any", "all", "tuple", "sum",
A ./Lib/lib2to3/fixer_util.py:339:_def_syms = set([syms.classdef, syms.funcdef])
A ./Lib/lib2to3/fixer_util.py:382:_block_syms = set([syms.funcdef, syms.classdef, syms.trailer])
B ./Lib/lib2to3/main.py:220:    avail_fixes = set(refactor.get_fixers_from_package(fixer_pkg))
C ./Lib/lib2to3/main.py:221:    unwanted_fixes = set(fixer_pkg + ".fix_" + fix for fix in options.nofix)
A ./Lib/lib2to3/patcomp.py:35:    skip = set((token.NEWLINE, token.INDENT, token.DEDENT))
A ./Lib/lib2to3/refactor.py:60:        return set([pat.type])
A ./Lib/lib2to3/fixes/fix_dict.py:39:iter_exempt = fixer_util.consuming_calls | set(["iter"])
B ./Lib/lib2to3/fixes/fix_numliterals.py:25:        elif val.startswith('0') and val.isdigit() and len(set(val)) > 1:
B ./Lib/multiprocessing/managers.py:397:            self.id_to_obj[ident] = (obj, set(exposed), method_to_typeid)
B ./Lib/packaging/run.py:237:    for dist in set(opts['args']):
A ./Lib/packaging/compiler/msvc9compiler.py:243:    interesting = set(("include", "lib", "libpath", "path"))
B ./Lib/packaging/pypi/simple.py:138:        self._mirrors = set(mirrors)
B ./Lib/packaging/pypi/xmlrpc.py:184:        return [self._projects[name.lower()] for name in set(projects)]
B ./Lib/packaging/pypi/xmlrpc.py:95:                hidden_versions = set(all_versions) - set(existing_versions)
B ./Lib/packaging/pypi/xmlrpc.py:95:                hidden_versions = set(all_versions) - set(existing_versions)
A ./Lib/wsgiref/handlers.py:25:_is_request = {
B ./Lib/xml/etree/ElementTree.py:990:    HTML_EMPTY = set(HTML_EMPTY)
B ./Lib/xmlrpc/server.py:279:        methods = set(self.funcs.keys())
B ./Lib/xmlrpc/server.py:284:                methods |= set(self.instance._listMethods())
B ./Lib/xmlrpc/server.py:289:                methods |= set(list_public_methods(self.instance))

Type A: 48 (41.0%)
Type B: 59 (50.4%)
Type C: 10 (8.5%)