prefer longer doc comments when selecting commented symbols #1258

QuietMisdreavus · 2025-07-28T23:30:27Z

Bug/issue #, if applicable: rdar://156595902

Summary

When a symbol's information is loaded from multiple symbol graphs, Swift-DocC selects the documentation content from "the first Swift symbol with a doc comment, or the first symbol of any kind with a doc comment if no documented Swift symbol exists", where"first" here depends on the ordering of a Dictionary's keys. Because the ordering of those keys is not guaranteed, this can lead to a symbol's prose content being completely different between runs of DocC, despite no changes in the tool or the content, if there is a different comment between different platform's symbol graphs.

This PR addresses this by changing the way that commented symbols are selected. There is now a priority list of various sortings that decide which comment is selected for view:

Swift symbols get priority over non-Swift symbols, as before.
Then, pick the one with the longest comment.
When there are multiple comments with the same length, pick the one that sorts first lexicographically.

This could introduce a performance regression when there are many symbol graphs with identical comments, and those comments are long. I've tried to mitigate this, and i've run a benchmark with a moderately-sized framework with several platforms' symbol graphs, and it didn't seem to affect the timing in a significant way:

┌──────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ Metric                                   │ Change          │ main                 │ current              │
├──────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Duration for 'convert-total-time'        │ no change¹      │ 1.176 sec            │ 1.166 sec            │
│ Duration for 'documentation-processing'  │ no change²,³    │ 0.367 sec            │ 0.351 sec            │
│ Duration for 'finalize-navigation-index' │ no change⁴,⁵    │ 0.011 sec            │ 0.011 sec            │
│ Peak memory footprint                    │ no change⁶      │ 343.9 MB             │ 344 MB               │
│ Data subdirectory size                   │ no change⁷      │ 12.6 MB              │ 12.6 MB              │
│ Index subdirectory size                  │ no change       │ 413 KB               │ 413 KB               │
│ Total DocC archive size                  │ no change⁸      │ 16.7 MB              │ 16.7 MB              │
│ Topic Anchor Checksum                    │ no change       │ d41d8cd98f00b204e980 │ d41d8cd98f00b204e980 │
│ Topic Graph Checksum                     │ no change       │ 566b7978f799c47e6113 │ 566b7978f799c47e6113 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Dependencies

None

Testing

As this is a nondeterminism bug, the issue at hand is the way that data may be ordered differently from build to build. Therefore, any test will need to be run many times to ensure that the ordering of declarations is as expected. The unit test added in this PR captures the nondeterminism behavior mostly well; i have run into situations where running a test actually passed prior to the code change, but i ran into far more situations where the test failed due to the order being different, especially with a repetition count around 100.

Checklist

Make sure you check off the following items. If they cannot be completed, provide a reason.

Added tests
Ran the ./bin/test script and it succeeded
[ n/a ] Updated documentation if necessary

rdar://156595902

QuietMisdreavus · 2025-07-28T23:31:15Z

@swift-ci Please test

d-ronnqvist · 2025-07-30T12:11:44Z

This could introduce a performance regression when there are many symbol graphs with identical comments, and those comments are long. I've tried to mitigate this, and i've run a benchmark with a moderately-sized framework with several platforms' symbol graphs, and it didn't seem to affect the timing in a significant way:

Were the documentation comments different in the different symbol graph files? Otherwise I think the comparisons would all exit early without comparing the different lengths.

d-ronnqvist · 2025-07-30T12:04:08Z

Sources/SwiftDocC/Semantics/Symbol/UnifiedSymbol+Extensions.swift

+    fileprivate var fullText: String {
+        map(\.text).joined(separator: "\n")
+    }


minor: I don't see this being used anywhere.

Oh, this is probably a vestige of a previous iteration of the algorithm, i can take it out.

d-ronnqvist · 2025-07-30T12:08:46Z

Tests/SwiftDocCTests/Infrastructure/DocumentationContext/DocumentationContextTests.swift

+                    ])
+            ])
+
+        func runAssertions(forwards: Bool) throws {


FYI: if you want to run the same test assertions in different configurations you can also write a basic for loop (instead of an inner function) around the code:

for forwards in [true, false] {

Otherwise it's good for internal test functions to have a file and line argument that they pass to each XCTAssert... call so that a failure can be attributed to the specific configuration that caused it.

d-ronnqvist · 2025-07-30T12:17:06Z

Sources/SwiftDocC/Semantics/Symbol/UnifiedSymbol+Extensions.swift

+            if lhs.value.lines.totalCount == rhs.value.lines.totalCount {
+                // if the comments are the same length, just sort them lexicographically
+                return lhs.value.lines.isLexicographicallyBefore(rhs.value.lines)
+            } else {
+                // otherwise, sort by the length of the doc comment,
+                // so that `min` returns the longest comment
+                return lhs.value.lines.totalCount > rhs.value.lines.totalCount
+            }


Question: which case it the more common? If it's somewhat common to have different length documentation comments from different symbol graph files then it might be worth not recomputing the lengths in the second comparison.

Suggested change

if lhs.value.lines.totalCount == rhs.value.lines.totalCount {

// if the comments are the same length, just sort them lexicographically

return lhs.value.lines.isLexicographicallyBefore(rhs.value.lines)

} else {

// otherwise, sort by the length of the doc comment,

// so that `min` returns the longest comment

return lhs.value.lines.totalCount > rhs.value.lines.totalCount

}

let lhsLength = lhs.value.lines.totalCount

let rhsLength = rhs.value.lines.totalCount

if lhsLength == rhsLength {

// if the comments are the same length, just sort them lexicographically

return lhs.value.lines.isLexicographicallyBefore(rhs.value.lines)

} else {

// otherwise, sort by the length of the doc comment,

// so that `min` returns the longest comment

return lhsLength > rhsLength

}

Even if it doesn't have a measurable impact on the scale of a full build, it's still unnecessary repeating work.

It's probably way more common to have identical doc comments between different symbols than to have different ones. I do agree that precomputing the count feels nicer, though, so i can do that.

QuietMisdreavus · 2025-07-30T20:42:19Z

Were the documentation comments different in the different symbol graph files? Otherwise I think the comparisons would all exit early without comparing the different lengths.

The test data i was using did have different doc comments between symbol graphs, so it was supposed to be flexing the new code.

d-ronnqvist

LGTM

QuietMisdreavus requested review from d-ronnqvist and sofiaromorales July 28, 2025 23:30

QuietMisdreavus added 2 commits July 28, 2025 17:30

prefer longer doc comments when selecting commented symbols

ef4f610

rdar://156595902

attempt to mitigate the performance impact

383d13f

QuietMisdreavus force-pushed the deterministic-comments branch from ae22386 to 383d13f Compare July 28, 2025 23:31

d-ronnqvist reviewed Jul 30, 2025

View reviewed changes

d-ronnqvist approved these changes Jul 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

prefer longer doc comments when selecting commented symbols #1258

prefer longer doc comments when selecting commented symbols #1258

QuietMisdreavus commented Jul 28, 2025

Uh oh!

QuietMisdreavus commented Jul 28, 2025

Uh oh!

d-ronnqvist commented Jul 30, 2025

Uh oh!

d-ronnqvist Jul 30, 2025

Uh oh!

QuietMisdreavus Jul 30, 2025

Uh oh!

d-ronnqvist Jul 30, 2025

Uh oh!

d-ronnqvist Jul 30, 2025

Uh oh!

QuietMisdreavus Jul 30, 2025

Uh oh!

QuietMisdreavus commented Jul 30, 2025

Uh oh!

d-ronnqvist left a comment

Uh oh!

Uh oh!

prefer longer doc comments when selecting commented symbols #1258

Are you sure you want to change the base?

prefer longer doc comments when selecting commented symbols #1258

Conversation

QuietMisdreavus commented Jul 28, 2025

Summary

Dependencies

Testing

Checklist

Uh oh!

QuietMisdreavus commented Jul 28, 2025

Uh oh!

d-ronnqvist commented Jul 30, 2025

Uh oh!

d-ronnqvist Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

QuietMisdreavus Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

d-ronnqvist Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

d-ronnqvist Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

QuietMisdreavus Jul 30, 2025

Choose a reason for hiding this comment

Uh oh!

QuietMisdreavus commented Jul 30, 2025

Uh oh!

d-ronnqvist left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!