Architecture¶
How Sitting Duck turns code into data¶
When you write SELECT * FROM read_ast('src/**/*.py'), several things happen:
- File discovery — glob patterns expand to a file list, deduplicated and sorted
- Language detection — each file's extension maps to a tree-sitter grammar
- Parsing — tree-sitter produces a concrete syntax tree (zero-copy, error-recovering)
- DFS traversal — the tree is walked in pre-order, producing one row per node with
node_idvalues that encode the traversal order - Semantic enrichment — each node gets a
semantic_typefrom the language adapter's type map, plusname,scope,qualified_name, and optional native context (signature_type,parameters,modifiers,annotations) - Streaming output — rows stream to DuckDB in 2048-row chunks, so large codebases never require the full AST in memory
The result is a flat SQL table where tree structure is encoded in three columns: parent_id (immediate parent), depth (tree level), and descendant_count (subtree size). This encoding makes subtree membership a simple integer comparison — d.node_id > a.node_id AND d.node_id <= a.node_id + a.descendant_count — which is what powers :has(), combinators, and scope resolution without recursive CTEs.
Component overview¶
┌──────────────────────────────────────────────────────┐
│ DuckDB Interface │
│ read_ast() · parse_ast() · ast_select() · macros │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Unified AST Backend │
│ File handling · streaming · context extraction │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Language Adapter Registry │
│ 27 adapters · auto-detection · type maps │
└──────────────────────────────────────────────────────┘
│
▼
┌──────────────────────────────────────────────────────┐
│ Tree-sitter Parsers │
│ Grammar-specific parsing · error recovery │
└──────────────────────────────────────────────────────┘
Why tree-sitter?¶
Tree-sitter parsers are incremental (fast re-parse on edits), error-recovering (produce a tree even for broken syntax), and grammar-driven (adding a language is adding a grammar, not writing a parser). Sitting Duck ships 27 grammars as git submodules, each producing the same TSTree interface regardless of language.
Why a flat table?¶
An AST is a tree, but SQL is relational. Sitting Duck flattens the tree into a table where tree structure is encoded via DFS pre-order numbering:
parent_id— the direct parent'snode_iddepth— distance from the root (0 for root)descendant_count— number of nodes in this node's subtree
This gives you O(1) subtree membership (node_id range check), O(depth) ancestor walks, and composability with standard SQL joins — without recursive CTEs or graph queries.
The semantic type system¶
Every AST node type from every language maps to a universal 8-bit semantic_type:
Byte layout: [ ss kk tt ll ]
ss (bits 6-7): super_kind — 4 top-level categories
kk (bits 4-5): kind — 4 sub-categories per super_kind
tt (bits 2-3): super_type — 4 variants per kind
ll (bits 0-1): refinement — language-specific (reserved)
This lets you write WHERE semantic_type = 'DEFINITION_FUNCTION' and match Python def, JavaScript function, Rust fn, Go func, Java methods, and C++ function definitions — all with the same predicate. The CSS selector alias .func maps to the same check.
See Semantic Types Reference for the full type table.
The scope system¶
Every node carries a scope STRUCT with precomputed shortcuts:
scope.current — nearest enclosing scope's node_id
scope.function — nearest enclosing function's node_id
scope.class — nearest enclosing class's node_id
scope.module — nearest enclosing module/namespace's node_id
scope.stack — full ancestor chain with semantic kinds (scope nodes only)
These are computed during the DFS traversal by maintaining a stack of scope-creating nodes. When a node has IS_SCOPE in its flags, it pushes onto the stack; when traversal leaves its subtree, it pops. Every node reads the stack to fill its scope fields.
This eliminates the range-join pattern that would otherwise be needed to answer "what function is this inside?" — a question that comes up in call-graph queries, scope resolution, and the :scope() CSS pseudo-class.
The CSS selector engine¶
ast_select parses CSS selectors using tree-sitter's own CSS grammar (sitting duck querying itself), then translates the parsed selector AST into SQL conditions. The engine is implemented entirely in SQL macros — no C++ selector logic.
The dispatch splits by selector root type:
- Simple selectors (type, class, id, attribute) — scan the AST with filter predicates
- Combinators (descendant, child, sibling, adjacent) — join the AST against itself using the DFS range check
- Pseudo-classes (:has, :not, :scope, :calls, etc.) — correlated subqueries against the AST
- Pseudo-elements (::callers, ::callees, ::parent) — post-filter navigation from matched nodes
See CSS Selector Syntax for the full selector reference.
For contributors¶
Implementation details for people working on the codebase:
| Area | Key files |
|---|---|
| Extension entry point | src/sitting_duck_extension.cpp |
| AST backend + output | src/unified_ast_backend.cpp, src/include/unified_ast_backend_impl.hpp |
| Language adapters | src/language_adapters/*.cpp, src/include/*_native_extractors.hpp |
| Type definitions | src/language_configs/*.def |
| Semantic type system | src/include/semantic_types.hpp, src/semantic_type_functions.cpp |
| CSS selector engine | src/sql_macros/css_selectors.sql |
| Scope resolution | src/sql_macros/scope_resolution.sql |
| Grammar patching | scripts/apply_grammar_patches.sh |
See Adding Languages for the full language-addition workflow and Contributing for development setup.