Skip to content

Conversation

@dovgopoly
Copy link
Collaborator

@dovgopoly dovgopoly commented Jan 6, 2026

Changes

This PR refactors the bcli plugin from asynchronous to synchronous execution, addressing #8738.

Synchronous execution

  • Add run_bitcoin_cli function that blocks until bitcoin-cli completes.
  • Remove unused async infrastructure (pending queues, priority scheduling, retry callbacks, etc.).

bcli

  • estimatefees returns null feerates on failure
  • sendrawtransaction returns success: false with error message
  • getutxout returns null amount/script on failure
  • getrawblockbyheight returns "not found" on any getblockhash exit status (not just exit code 8).
  • getchaininfo failure, corrupted JSON or exhausted getrawblockbyheight retries return an error to lightningd, crashing with a clear error message.
  • Add synchronous retry logic for getrawblockbyheight using getblockfrompeer similar to async version.

lightningd

  • Add get_bitcoin_result in lightningd/bitcoind.c to check bcli responses for errors and extract the result token.
  • Provide clearer error messages when bcli returns an error (e.g., timeout after retries in getrawblockbyheight or invalid JSON).

Other improvements

  • Fix file descriptor leak (missing close(from))
  • Add tests for getblockfrompeer retry path and retry timeout behavior

plugins/bcli.c Outdated
res->output_len = strlen(res->output);

/* Wait for child to exit */
while (waitpid(child, &status, 0) < 0 && errno == EINTR);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be good to have something like this here:

	while (waitpid(child, &status, 0) < 0) {
		if (errno == EINTR)
			continue;
		plugin_err(plugin, "waitpid(%s) failed: %s",
			   res->args, strerror(errno));
	}

Copy link
Collaborator Author

@dovgopoly dovgopoly Jan 9, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, added this.

Also I think I forgot to close a from descriptor in wait_and_check_bitcoind and run_bitcoin_cliv. Added this as well.

* we can attempt to prevent a crash if the 'getchaininfo' function returns
* a lower height than the one we already know, by waiting for a short period.
* However, I currently don't have a better idea on how to handle this situation. */
u32 *height UNUSED;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to leave this as it is and not use the optional last_height parameter. We're planning on getting rid of retries altogether inside bcli and have lightningd solely deal with retrying. So it feels a bit clunky to me to add retrying based on the last_height and ALSO have it be blocking to the rest of the plugin. We were never using it before and I don't think anyone has complained about this (yet) so it's probably not fatal?

Comment on lines +1138 to +701
/* As of at least v0.15.1.0, bitcoind returns "success" but an empty
string on a spent txout. */
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't see this documented in the bitcoin-cli documentation https://developer.bitcoin.org/reference/rpc/gettxout.html , it might be good to test it out on cli to see if it every comes up with something else that's not an empty string but something like the string null?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That should be alright. I've found an issue raising this.

plugins/bcli.c Outdated
Comment on lines 496 to 504
static struct command_result *command_err_badjson(struct command *cmd,
struct bcli_result *res,
const char *errmsg)
{
char *err = tal_fmt(cmd, "%s: bad JSON: %s (%.*s)",
res->args, errmsg, (int)res->output_len, res->output);
return command_done_err(cmd, BCLI_ERROR, err, NULL);
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function is actually pretty good to have! Despite the plugin now being synchronous, it doesn't have to die, which plugin_err would do.

if (res->exitstatus != 0)
return estimatefees_null_response(cmd);

tokens = json_parse_simple(res->output, res->output, res->output_len);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this pattern quite absurd because res->output is not a ctx! But it's used all over bcli in other places so it's not your fault at all! Have asked @rustyrussell about this and if it is intentional.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asked him about this, this works okay in these scenarios because the lifetime of the pointers are the same so once the parent pointer is cleaned up any children are also cleaned up at the same time.

@dovgopoly dovgopoly force-pushed the feat/refactor-bcli branch 2 times, most recently from 2faf7ec to 8512a1d Compare January 9, 2026 22:31
@dovgopoly dovgopoly requested a review from sangbida January 9, 2026 22:33
@dovgopoly dovgopoly marked this pull request as ready for review January 10, 2026 15:25
@dovgopoly dovgopoly force-pushed the feat/refactor-bcli branch 6 times, most recently from abf4d82 to ae11557 Compare January 13, 2026 15:31
# We block l3 from seeing close, so it will try to reestablish.
def no_new_blocks(req):
return {"error": "go away"}
return {"error": {"code": -8, "message": "Block height out of range"}}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we revert with another code, it kills lightningd instantly, not making retry anymore.
I think it's expected to return a null response to the lightningd here.

Copy link
Collaborator Author

@dovgopoly dovgopoly Jan 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After dddf995 it doesn't break the tests and we can remove these changes, but I decided to leave it as it is.

if (res->exitstatus == 8)
return getrawblockbyheight_notfound(cmd);
return command_err(cmd, res, "command failed");
return getrawblockbyheight_notfound(cmd);
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think returning a null response should be alright. I don't see the case we need to return error here. We can return an empty block and try again later. Tests are waiting the same behavior.

}
}

sleep(1);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that the sleep was there in the async implementation of the plugin but I would add a test that runs this under load. So try to getrawblockbyheight and also try to run bcli commands at the same time.

Comment on lines 114 to 132
/* Check if response is an error and fail with clear message if so */
static void check_bitcoin_error(struct bitcoind *bitcoind, const char *buf,
const jsmntok_t *toks, const char *method)
{
const jsmntok_t *err_tok, *msg_tok;

err_tok = json_get_member(buf, toks, "error");
if (!err_tok)
return;

msg_tok = json_get_member(buf, err_tok, "message");
if (msg_tok)
fatal("%s: %.*s", method,
json_tok_full_len(msg_tok), json_tok_full(buf, msg_tok));
else
fatal("%s: %.*s", method,
json_tok_full_len(err_tok), json_tok_full(buf, err_tok));
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the motivation behind this function. This seems like it kills bitcoind if an error response or a timeout happens from bitcoind. It's not uncommon for bitcoind (the actual bitcoin core daemon) to go out for lunch and have a timeout.

@rustyrussell checked his node and these were top 5 bitcoind timeouts in msec:
21812
22053
22406
23600
30535
60324

Separate from this, commands like estimatefees and getblockhash can fail when they are syncing. In our meeting, Rusty also suggested that it would be good to change that function from a check_bitcoin_error to something like check/get_bitcoin_result? That way you can do a lot of common result handling inside that function and as a result of that tidy up some of the code.

If we're doing this we should probably be a bit more discerning with our retry behaviour:

sendrawtransaction - This one is probably okay without a retry because we have rebroadcasting mechanisms already in lightningd.

getchaininfo - Lightningd retry loops rely on getchaininfo succeeding I think, so I wonder if this needs more internal retries or if it's okay for this to just die on one missed rpc call?

getrawblockbyheight - we have a retry on this one

estimatefees - This does fail during sync but if it returns the estimatefees_null_response() that's probably okay since it's not an error?

getutxoout - I think this one is fine as well because we only crash on a broken json.

plugins/bcli.c Outdated
}

strip_trailing_whitespace(res->output, res->output_len);
if (strlen(res->output) != 64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: magic number

plugins/bcli.c Outdated

return command_still_pending(cmd);
/* Try fetching from peers if bitcoind >= 23.0.0 */
if (bitcoind->version >= 230000) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: magic number

const char *method,
...)
static LAST_ARG_NULL struct bcli_result *
run_bitcoin_cli(const tal_t *ctx,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can use this function to refactor wait_and_check_bitcoind

@dovgopoly dovgopoly force-pushed the feat/refactor-bcli branch 4 times, most recently from d7d9a93 to fb68ed9 Compare January 18, 2026 10:21
@sangbida sangbida requested a review from rustyrussell January 18, 2026 11:01
Add `command_err_badjson` helper for sync error handling, mirroring the async `command_err_bcli_badjson`. Store args string in `bcli_result` for consistent error messages.
Also rename command_err_badjson to generic command_err helper, since error messages aren't always about bad JSON (e.g., "command failed" for non-zero exit).
@dovgopoly dovgopoly changed the title Feat/refactor-bcli bcli: Refactor bcli plugin to synchronous execution Jan 19, 2026
dovgopoly and others added 6 commits January 19, 2026 13:30
Add `get_bitcoin_result` function that checks bcli plugin responses for errors and returns the result token. Previously, callbacks only detected errors when result parsing failed, ignoring the explicit error field from the plugin. Now we extract the actual error message from bcli, providing clearer reasoning when the plugin returns an error response.
Rewrite `test_bitcoin_failure` to reflect synchronous bcli behavior: the node now crashes on invalid bitcoind responses rather than retrying. Add `may_fail` and `broken_log` to handle expected crash.

Update `test_bitcoind_fail_first` stderr check to match the new error message format from `get_bitcoin_result`.

Update test mocks to use proper error format for "block not found".

Co-authored-by: ShahanaFarooqui <[email protected]>
Remove the asynchronous execution infrastructure no longer needed after converting all bcli commands to synchronous execution. This includes removing the async callbacks, the pending request queue, etc.

Fix missing `close(from)` file descriptor leak in `run_bitcoin_cliv`.

Changelog-Changed: bcli plugin now uses synchronous execution, simplifying bitcoin backend communication and improving error handling reliability.
Return "not found" on any `getblockhash` exit status. Previously, only exit code 8 (block height doesn't exist) returned "not found", while other exit codes returned an error. Now any non-zero exit status returns "not found" since any failure means the block is unavailable.
… shared execution

Extract `execute_bitcoin_cli` as shared function used by both `run_bitcoin_cli` and `wait_and_check_bitcoind`.
Add `test_bcli_concurrent` to verify bcli handles concurrent requests while the `getblockfrompeer` retry path is active, simulating a pruned node scenario where `getblock` initially fails.

Add `test_bcli_retry_timeout` to verify lightningd crashes with a clear error message when we run out of `getblock` retries.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants