Skip to content

Conversation

@poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Jan 14, 2026

Motivation

Background

  • Ledger states management
    • In BK server, ledgers are managed by HandleFactoryImpl, which maintains the ledger state.
    • When fencing entries, BK change the state of the Ledger Handle to fenced
    • When adding entries, BK checks fencing state first, then apply writing.
  • Open ledger with recovery
    • BK fence ledger first, then read LAC to ensure no entries missed
  • Threads design
    • Opening ledger with recovery works on the thread BookieHighPriorityThread
    • Adding entries works on the thread BookieWriteThreadPool

To avoid the following multi-threading competition issue,BK uses a lock[1]

(BookieWriteThreadPool)Add entry (BookieHighPriorityThread)Opening ledger with recovery
get LedgerDescriptor from HandleFactoryImpl
check fenced status: non-fenced
get LedgerDescriptor from HandleFactoryImpl
fence ledger
read Lac from write cache
add entry into write cache
add entry into Journal Disk
success
Lac is 0 Lac is -1
public void addEntry(ByteBuf entry, boolean ackBeforeSync, WriteCallback cb, Object ctx, byte[] masterKey)
            throws IOException, BookieException, InterruptedException {
        try {
            LedgerDescriptor handle = getLedgerForEntry(entry, masterKey);
            synchronized (handle) {
                if (handle.isFenced()) {
                    throw BookieException
                            .create(BookieException.Code.LedgerFencedException);
                }
                addEntryInternal(handle, entry, ackBeforeSync, cb, ctx, masterKey);
            }
        } 
}
synchronized CompletableFuture<Boolean> fenceAndLogInJournal(Journal journal) throws IOException {
    boolean success = this.setFenced();
    return logFenceEntryInJournal(journal);
}

Issue: there is a race condition breaks the lock above

https://github.com/apache/bookkeeper/blob/release-4.17.2/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/HandleFactoryImpl.java#L54-L68

    public LedgerDescriptor getHandle(final long ledgerId, final byte[] masterKey, boolean journalReplay)
            throws IOException, BookieException {
        LedgerDescriptor handle = ledgers.get(ledgerId);

        if (handle == null) {
            handle = LedgerDescriptor.create(masterKey, ledgerId, ledgerStorage);
            ledgers.putIfAbsent(ledgerId, handle);
        }

        handle.checkAccess(masterKey);
        return handle;
    }
(BookieWriteThreadPool)Add entry (BookieHighPriorityThread)Opening ledger with recovery
get LedgerDescriptor from HandleFactoryImpl get LedgerDescriptor from HandleFactoryImpl
check if created: false check if created: false
creates a new one creates a new one

Changes

Fix the bug

@zymap zymap requested a review from Copilot January 15, 2026 09:20
}
}
handle = LedgerDescriptor.create(masterKey, ledgerId, ledgerStorage);
ledgers.putIfAbsent(ledgerId, handle);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we can avoid the synchronization by handling the result of the putIfAbsent? I understand the main issue is that here we create two LedgerDescriptor, then used by different threads. But the putIfAbsent will reduce one. So we can handle the result to ensure the same handle is returned?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved, please review again

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a race condition in HandleFactoryImpl.getHandle() that could lead to entry loss when multiple threads simultaneously create a LedgerDescriptor for the same ledger. The issue occurs when both the write thread and the recovery thread attempt to create a new descriptor concurrently, potentially breaking the synchronization lock that prevents fencing races.

Changes:

  • Implemented double-checked locking pattern in getHandle() to ensure only one LedgerDescriptor is created per ledger ID
  • Synchronized on the ledgers map during descriptor creation to prevent race conditions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@poorbarcode poorbarcode requested a review from zymap January 16, 2026 03:03
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@poorbarcode
Copy link
Contributor Author

rerun failure checks

1 similar comment
@poorbarcode
Copy link
Contributor Author

rerun failure checks

Copy link
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job!

@hangc0276 hangc0276 closed this Jan 20, 2026
@hangc0276 hangc0276 reopened this Jan 20, 2026
@zymap zymap added this to the 4.18.0 milestone Jan 20, 2026
@zymap zymap merged commit 44607a0 into apache:master Jan 20, 2026
100 of 118 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants