User:Jorend/Python set usage
Abstract
A Set()
constructor is proposed for a future edition of the ECMAScript programming language standard. Should the constructor take a single iterable argument or any number of arguments? One factor in such a design decision is usefulness. Variable arguments are nicer when the number of values to be added to the set is known at the call site, as in Set(1, 2, 3)
vs. Set([1, 2, 3])
. A single iterable argument is nicer when the number of values is not statically known, as in Set(names)
vs. Set(...names)
. Which need is more common?
A script was used to search the Python standard library for expressions that create nonempty sets. Of these expressions, the number of values in the new set is statically apparent in only 41% of the cases.
Background
The Python programming language offers a mutable, unordered set
data structure, implemented using a hash table. There are several ways to create one:
- Set literal syntax.
{'426', '225', '226'}
is a set of three values.{a, b}
is a set containing both a and b; it is of size 2 unless it happens thata == b
, in which case it is of size 1. Setting aside duplicates, the number of values in a set literal is always statically apparent.
- Set comprehension syntax. Python has syntax based on the mathematical “set-builder” notation. In mathematics, the set of all ordered pairs of primes may be written: { <a, b> | a ∈ P, b ∈ P }. The corresponding Python would be
prime_pairs = { (a, b) for a in P for b in P}
though of course in Python,P
would have to be finite or the program would never finish. The number of values in a set comprehension is not usually statically apparent in practice.
- Calling the
set
constructor. The constructor takes an optional iterable argument which provides the values. The number of values in a call toset
is sometimes statically apparent, as inset([SUCCESS, FAILURE])
, and sometimes not, as inset(self.get_labels())
.
Methodology
A copy of the Python standard library and a corresponding Python executable were obtained using the commands:
hg clone http://hg.python.org/cpython cd cpython ./configure make
(hg
is Mercurial; it can be installed on most Linux systems using something like sudo apt-get install mercurial
.)
(On Mac OS X Lion, configure with CC=gcc-apple-4.2 ./configure
instead. Apple’s LLVM-based gcc that ships with Lion can’t build Python.)
The script below was then run in the same directory. The script searches the Lib
directory for Python files. It disregards unit tests, as they typically contain very unusual code. It parses each other file using the Python parser and finds each set-creation expression. The expression set()
, which creates an empty set, is disregarded; the rest are classified as one of the following:
- A - statically known number of values (supporting variable arguments)
- B - not statically known number of values (supporting a single iterable argument)
- C - using a comprehension (also supporting a single iterable argument, assuming ECMAScript gets array comprehensions)
import os, ast, collections def target_files(dir): for dirpath, dirnames, filenames in os.walk(dir): dirnames[:] = [d for d in dirnames if d not in ('test', 'tests')] for f in filenames: if f.endswith('.py'): yield os.path.join(dirpath, f) def main(): counts = collections.defaultdict(int) for filename in target_files('./Lib'): with open(filename) as f: src = f.read() lines = src.splitlines() parse_tree = ast.parse(src, filename) def log(node, code): counts[code] += 1 print('{0} {1}:{2}:{3}'.format(code, filename, node.lineno, lines[node.lineno-1])) for node in ast.walk(parse_tree): if isinstance(node, ast.Call): if isinstance(node.func, ast.Name) and node.func.id == 'set': assert len(node.args) in (0, 1) assert len(node.keywords) == 0 assert node.starargs is None assert node.kwargs is None if len(node.args) == 1: arg = node.args[0] if isinstance(arg, (ast.Str, ast.List, ast.Tuple)): log(node, 'A') elif isinstance(arg, (ast.ListComp, ast.SetComp, ast.GeneratorExp)): log(node, 'C') else: log(node, 'B') elif isinstance(node, ast.Set): log(node, 'A') elif isinstance(node, ast.SetComp): log(node, 'C') total = sum(counts.values()) print() for t, n in sorted(counts.items()): print("Type {0}: {1} ({2:.1%})".format(t, n, n / total)) main()
Results
The full output is shown below. In short there are 117 qualifying set-creation expressions in the standard library. Of these:
- 48 (41%) are type A (supporting a constructor with variable arguments),
- 59 (50.4%) are type B (supporting a constructor with a single iterable argument), and
- 10 (8.5%) are type C (comprehensions, supporting a constructor with a single iterable argument)
A ./Lib/_markupbase.py:152: if sectName in {"temp", "cdata", "ignore", "include", "rcdata"}: A ./Lib/_markupbase.py:155: elif sectName in {"if", "else", "endif"}: A ./Lib/_markupbase.py:210: if name not in {"attlist", "element", "entity", "notation"}: A ./Lib/_markupbase.py:129: elif decltype in {"attlist", "linktype", "link", "element"}: B ./Lib/_pyio.py:158: modes = set(mode) A ./Lib/_pyio.py:159: if modes - set("axrwb+tU") or len(mode) > len(modes): C ./Lib/_weakrefset.py:175: return self.data <= set(ref(item) for item in other) C ./Lib/_weakrefset.py:182: return self.data >= set(ref(item) for item in other) C ./Lib/_weakrefset.py:187: return self.data == set(ref(item) for item in other) C ./Lib/abc.py:132: abstracts = {name A ./Lib/abc.py:189: return any(cls.__subclasscheck__(c) for c in {subclass, subtype}) B ./Lib/ast.py:60: return set(map(_convert, node.elts)) B ./Lib/bdb.py:22: self.skip = set(skip) if skip else None C ./Lib/cgi.py:627: return list(set(item.name for item in self.list)) B ./Lib/cmd.py:287: commands = set(self.completenames(*args)) C ./Lib/cmd.py:288: topics = set(a[5:] for a in self.get_names() A ./Lib/difflib.py:1221: if tag in {'replace', 'delete'}: A ./Lib/difflib.py:1224: if tag in {'replace', 'insert'}: A ./Lib/difflib.py:1305: if any(tag in {'replace', 'delete'} for tag, _, _, _, _ in group): A ./Lib/difflib.py:1314: if any(tag in {'replace', 'insert'} for tag, _, _, _, _ in group): A ./Lib/ftplib.py:931: if treply[:3] not in {'125', '150'}: raise error_proto # RFC 959 A ./Lib/ftplib.py:933: if sreply[:3] not in {'125', '150'}: raise error_proto # RFC 959 A ./Lib/ftplib.py:178: if s[:5] in {'pass ', 'PASS '}: A ./Lib/ftplib.py:228: if c in {'1', '2', '3'}: A ./Lib/ftplib.py:252: if resp[:3] not in {'426', '225', '226'}: A ./Lib/ftplib.py:570: if resp[:3] in {'250', '200'}: A ./Lib/ftplib.py:388: if user == 'anonymous' and passwd in {, '-'}: A ./Lib/ftplib.py:819: if resp[:3] not in {'426', '225', '226'}: B ./Lib/hashlib.py:59:algorithms_guaranteed = set(__always_supported) B ./Lib/hashlib.py:60:algorithms_available = set(__always_supported) B ./Lib/inspect.py:994: possible_kwargs = set(args + kwonlyargs) B ./Lib/mailbox.py:1614: flags = set(flags) B ./Lib/mailbox.py:1110: all_keys = set(self.keys()) B ./Lib/mailbox.py:1646: flags = set(self.get_flags()) B ./Lib/mailbox.py:1733: sequences = set(self.get_sequences()) B ./Lib/mailbox.py:1823: labels = set(self.get_labels()) B ./Lib/mailbox.py:1547: flags = set(self.get_flags()) B ./Lib/mailbox.py:1744: sequences = set(self.get_sequences()) B ./Lib/mailbox.py:1836: labels = set(self.get_labels()) B ./Lib/mailbox.py:1141: for key in sorted(set(keys)): B ./Lib/mailbox.py:1511: self.set_flags(.join(set(self.get_flags()) | set(flag))) B ./Lib/mailbox.py:1511: self.set_flags(.join(set(self.get_flags()) | set(flag))) B ./Lib/mailbox.py:1560: flags = set(self.get_flags()) B ./Lib/mailbox.py:1636: self.set_flags(.join(set(self.get_flags()) | set(flag))) B ./Lib/mailbox.py:1636: self.set_flags(.join(set(self.get_flags()) | set(flag))) B ./Lib/mailbox.py:1669: flags = set(self.get_flags()) B ./Lib/mailbox.py:1846: labels = set(self.get_labels()) B ./Lib/mailbox.py:1516: self.set_flags(.join(set(self.get_flags()) - set(flag))) B ./Lib/mailbox.py:1516: self.set_flags(.join(set(self.get_flags()) - set(flag))) B ./Lib/mailbox.py:1568: flags = set(self.get_flags()) B ./Lib/mailbox.py:1641: self.set_flags(.join(set(self.get_flags()) - set(flag))) B ./Lib/mailbox.py:1641: self.set_flags(.join(set(self.get_flags()) - set(flag))) B ./Lib/mailbox.py:1679: flags = set(self.get_flags()) B ./Lib/mailbox.py:1757: sequences = set(self.get_sequences()) A ./Lib/netrc.py:74: tt in {, 'machine', 'default', 'macdef'}): A ./Lib/nntplib.py:124:_LONGRESP = { A ./Lib/pydoc.py:166: if name in {'__builtins__', '__doc__', '__file__', '__path__', A ./Lib/pydoc.py:2022: if modname in {'test.badsyntax_pep3120', 'badsyntax_pep3120'}: B ./Lib/shutil.py:204: return set(ignored_names) B ./Lib/site.py:84: for m in set(sys.modules.values()): A ./Lib/socket.py:233:_blocking_errnos = { EAGAIN, EWOULDBLOCK } A ./Lib/socket.py:152: if c not in {"r", "w", "b"}: A ./Lib/sre_compile.py:27:_LITERAL_CODES = set([LITERAL, NOT_LITERAL]) A ./Lib/sre_compile.py:28:_REPEATING_CODES = set([REPEAT, MIN_REPEAT, MAX_REPEAT]) A ./Lib/sre_compile.py:29:_SUCCESS_CODES = set([SUCCESS, FAILURE]) A ./Lib/sre_compile.py:30:_ASSERT_CODES = set([ASSERT, ASSERT_NOT]) A ./Lib/sre_parse.py:22:DIGITS = set("0123456789") A ./Lib/sre_parse.py:24:OCTDIGITS = set("01234567") A ./Lib/sre_parse.py:25:HEXDIGITS = set("0123456789abcdefABCDEF") A ./Lib/sre_parse.py:27:WHITESPACE = set(" \t\n\r\v\f") A ./Lib/sre_parse.py:381:_PATTERNENDERS = set("|)") A ./Lib/sre_parse.py:382:_ASSERTCHARS = set("=!<") A ./Lib/sre_parse.py:383:_LOOKBEHINDASSERTCHARS = set("=!") A ./Lib/sre_parse.py:384:_REPEATCODES = set([MIN_REPEAT, MAX_REPEAT]) B ./Lib/stringprep.py:19:b1_set = set([173, 847, 6150, 6155, 6156, 6157, 8203, 8204, 8205, 8288, 65279] + list(range(65024,65040))) B ./Lib/stringprep.py:220:c22_specials = set([1757, 1807, 6158, 8204, 8205, 8232, 8233, 65279] + list(range(8288,8292)) + list(range(8298,8304)) + list(range(65529,65533)) + list(range(119155,119163))) B ./Lib/stringprep.py:247:c6_set = set(range(65529,65534)) B ./Lib/stringprep.py:252:c7_set = set(range(12272,12284)) B ./Lib/stringprep.py:257:c8_set = set([832, 833, 8206, 8207] + list(range(8234,8239)) + list(range(8298,8304))) B ./Lib/stringprep.py:262:c9_set = set([917505] + list(range(917536,917632))) B ./Lib/subprocess.py:1280: fds_to_keep = set(pass_fds) B ./Lib/sysconfig.py:724: archs = tuple(sorted(set(archs))) A ./Lib/tarfile.py:129:PAX_NAME_FIELDS = {"path", "linkpath", "uname", "gname"} B ./Lib/trace.py:133: self._mods = set() if not modules else set(modules) B ./Lib/collections/abc.py:421: return set(it) B ./Lib/collections/abc.py:437: return set(it) C ./Lib/concurrent/futures/_base.py:192: finished = set( C ./Lib/concurrent/futures/_base.py:254: done = set(f for f in fs B ./Lib/concurrent/futures/_base.py:195: pending = set(fs) - finished B ./Lib/concurrent/futures/_base.py:256: not_done = set(fs) - done B ./Lib/concurrent/futures/_base.py:275: return DoneAndNotDoneFutures(done, set(fs) - done) A ./Lib/distutils/msvc9compiler.py:255: interesting = set(("include", "lib", "libpath", "path")) B ./Lib/distutils/util.py:153: archs = tuple(sorted(set(archs))) A ./Lib/idlelib/CodeContext.py:18:BLOCKOPENERS = set(["class", "def", "elif", "else", "except", "finally", "for", A ./Lib/idlelib/EditorWindow.py:1614: if (not keylist) or (macosxSupport.runningAsOSXApp() and eventname in { C ./Lib/idlelib/MultiCall.py:127: substates = list(set(state & x for x in states)) A ./Lib/lib2to3/fixer_util.py:167:consuming_calls = set(["sorted", "list", "set", "any", "all", "tuple", "sum", A ./Lib/lib2to3/fixer_util.py:339:_def_syms = set([syms.classdef, syms.funcdef]) A ./Lib/lib2to3/fixer_util.py:382:_block_syms = set([syms.funcdef, syms.classdef, syms.trailer]) B ./Lib/lib2to3/main.py:220: avail_fixes = set(refactor.get_fixers_from_package(fixer_pkg)) C ./Lib/lib2to3/main.py:221: unwanted_fixes = set(fixer_pkg + ".fix_" + fix for fix in options.nofix) A ./Lib/lib2to3/patcomp.py:35: skip = set((token.NEWLINE, token.INDENT, token.DEDENT)) A ./Lib/lib2to3/refactor.py:60: return set([pat.type]) A ./Lib/lib2to3/fixes/fix_dict.py:39:iter_exempt = fixer_util.consuming_calls | set(["iter"]) B ./Lib/lib2to3/fixes/fix_numliterals.py:25: elif val.startswith('0') and val.isdigit() and len(set(val)) > 1: B ./Lib/multiprocessing/managers.py:397: self.id_to_obj[ident] = (obj, set(exposed), method_to_typeid) B ./Lib/packaging/run.py:237: for dist in set(opts['args']): A ./Lib/packaging/compiler/msvc9compiler.py:243: interesting = set(("include", "lib", "libpath", "path")) B ./Lib/packaging/pypi/simple.py:138: self._mirrors = set(mirrors) B ./Lib/packaging/pypi/xmlrpc.py:184: return [self._projects[name.lower()] for name in set(projects)] B ./Lib/packaging/pypi/xmlrpc.py:95: hidden_versions = set(all_versions) - set(existing_versions) B ./Lib/packaging/pypi/xmlrpc.py:95: hidden_versions = set(all_versions) - set(existing_versions) A ./Lib/wsgiref/handlers.py:25:_is_request = { B ./Lib/xml/etree/ElementTree.py:990: HTML_EMPTY = set(HTML_EMPTY) B ./Lib/xmlrpc/server.py:279: methods = set(self.funcs.keys()) B ./Lib/xmlrpc/server.py:284: methods |= set(self.instance._listMethods()) B ./Lib/xmlrpc/server.py:289: methods |= set(list_public_methods(self.instance)) Type A: 48 (41.0%) Type B: 59 (50.4%) Type C: 10 (8.5%)