2ndQuadrant is now part of EDB

Bringing together some of the world's top PostgreSQL experts.

2ndQuadrant | PostgreSQL
Mission Critical Databases
  • Contact us
  • EN
    • FR
    • IT
    • ES
    • DE
    • PT
  • Support & Services
  • Products
  • Downloads
    • Installers
      • Postgres Installer
      • 2UDA – Unified Data Analytics
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
  • Postgres Learning Center
    • Webinars
      • Upcoming Webinars
      • Webinar Library
    • Whitepapers
      • Business Case for PostgreSQL Support
      • Security Best Practices for PostgreSQL
    • Blog
    • Training
      • Course Catalogue
    • Case Studies
      • Performance Tuning
        • BenchPrep
        • tastyworks
      • Distributed Clusters
        • ClickUp
        • European Space Agency (ESA)
        • Telefónica del Sur
        • Animal Logic
      • Database Administration
        • Agilis Systems
      • Professional Training
        • Met Office
        • London & Partners
      • Database Upgrades
        • Alfred Wegener Institute (AWI)
      • Database Migration
        • International Game Technology (IGT)
        • Healthcare Software Solutions (HSS)
        • Navionics
    • Books
      • PostgreSQL 11 Administration Cookbook
      • PostgreSQL 10 Administration Cookbook
      • PostgreSQL High Availability Cookbook – 2nd Edition
      • PostgreSQL 9 Administration Cookbook – 3rd Edition
      • PostgreSQL Server Programming Cookbook – 2nd Edition
      • PostgreSQL 9 Cookbook – Chinese Edition
    • Videos
    • Events
    • PostgreSQL
      • PostgreSQL – History
      • Who uses PostgreSQL?
      • PostgreSQL FAQ
      • PostgreSQL vs MySQL
      • The Business Case for PostgreSQL
      • Security Information
      • Documentation
  • About Us
    • About 2ndQuadrant
    • 2ndQuadrant’s Passion for PostgreSQL
    • News
    • Careers
    • Team Profile
  • Blog
  • Menu Menu
You are here: Home1 / Blog2 / Craig's PlanetPostgreSQL3 / Dev Corner: error context stack corruption
craig.ringer

Dev Corner: error context stack corruption

February 23, 2018/4 Comments/in Craig's PlanetPostgreSQL /by craig.ringer

PostgreSQL uses error context callbacks to allow code paths to annotate errors with additional information. For example, pl/pgsql uses them to add a CONTEXT message reporting the procedure that was executing at the time of the error.

But if you get it wrong when you use one in an extension or a patch to core, it can be quite hard to debug. I’d like to share some hints here for people involved in PostgreSQL’s C core and extensions.

Intro to error contexts

(If you know the postgres backend code, skip to the next heading).

Say you have a function like the following utterly contrived example:

void
my_func(bool do_it)
{
    if (!do_it)
    {
        elog(WARNING, "not doing it!");
        return;
    }

    do_the_thing();
}

and you want to report on errors that occur anywhere in it, even in code called by do_the_thing() that may be far away in different modules of PostgreSQL. So that you know that you reached that code via my_func() and what the value of do_it was.

You can add an error context callback, which pushes a function pointer + optional argument onto the head of a linked list of callbacks. The head is in a global error_context_stack. Typically the entries are stack-allocated, e.g.

struct my_func_ctx_arg
{
    bool do_it;
};
 
static void
my_func_ctx_callback(void *arg)
{
    struct my_func_ctx_arg *ctx_arg = arg;
    errcontext("during my_func(do_it=%d)", ctx_arg->do_it);
}

void
my_func(bool do_it)
{
    ErrorContextCallback myerrcontext;
    struct my_func_ctx_arg ctxinfo;

    ctxinfo.do_it = do_it;
    myerrcontext.callback = my_func_ctx_callback;
    myerrcontext.arg = &ctxinfo;
    myerrcontext.previous = error_context_stack;
    error_context_stack = &myerrcontext;

    if (!do_it)
    {
        elog(WARNING, "not doing it!");
        return;
    }

    do_the_thing();

    Assert(error_context_stack == &myerrcontext);
    error_context_stack = myerrcontext.previous;
}

It’s a bit verbose, but it gives you much more useful messages in important cases. For example

ERROR: relcache lookup for 2132 failed

isn’t that informative. But something like:

ERROR: relcache lookup for 2132 failed
CONTEXT: in my_extension_func(...) with user_callback_fn="user_func"

gives you a lot more of a hint about where to look.

There’s a bug

Fry from Futurama ponders: "Not sure if I should *facepalm* or *headdesk*"

A day in the life of a developer

OK, so we know what errcontext callbacks are. But when we added the above code, suddenly our postgres starts crashing… sometimes. Backtraces show that the crashes are usually in errfinish(), but in random and unpredictable places.

#0  0x0000000000000014 in ??
#1  0x000000000084bb88 in errfinish (dummy=) at elog.c:439
... some unrelated stack that doesn't mention my_func here ...

Much head scratching occurs. Valgrind is brought to bear, and maybe it complains about an invalid access in elog.c just before the crash, but says the pointed-to memory was not recently allocated or freed, and can’t really tell you anything more than the crash backtrace did.

You can see that the error context stack is mangled in gdb:

(gdb) set print pretty on
(gdb) p * error_context_stack
$4 = {
  previous = 0x4d430004, 
  callback = 0x4d430004, 
  arg = 0x18
}
(gdb) p *error_context_stack->previous
Cannot access memory at address 0x4d430004
(gdb)

but not why. The contents seem to vary randomly and are often null. Printing the memory around the pointer to error_context_stack doesn’t tend to reveal anything that jumps out at you. (Or didn’t to me, anyway; if you did more low level work and asm you might recognise it.)

When I present it like this, you can probably guess why. The problem is in my_func even though it doesn’t appear in any of the crashes, valgrind won’t report on it, etc. And it’s not directly in the new code, so in a larger or more complex function it might be harder to spot.

    if (!do_it)
        return; /*  <--------------------- HERE */

See, on this path we failed to pop the error context stack. So error_context_stack still points inside the now-popped stack frame of myfunc, at myerrcontext. It doesn't point at the code address of myfunc so gdb won't give you a hint like myfunc+12, it's just some random stack space.

If we call ereport or elog now, they'll still work fine because the popped stack's contents aren't immediately overwritten. They'll access released stack frame contents, but valgrind doesn't check for that and won't care or complain.

If at some later stage we make calls that use up that stack space again, we may (or may not, depending on layout details, what the calls are, etc) overwrite the pointed-to memory with something else. At which point if we call ereport or elog we might crash. Or hey, we might not, if whatever's pointed to doesn't fault when treated as instructions.

Especially in an optimised binary, the crash can be unpredictable and come much later than the creation of the problem.

gcc's -fstack-protector-all won't help you either, since there's no stack-overwrite happening. Just a pointer to invalid stack frame contents.

So I thought I'd make this a bit more google-able to save the next person some hassle.
I'm sure this would be blindingly obvious to many people. But in a decent sized code base with a fair few changes across multiple modules, it gets more challenging.

The fix

The fix is trivial once you know where the problem is: pop the error context stack or, preferably, restructure the function to have a single non-error exit path, e.g.

    if (do_it)
        do_the_thing();
    else
        elog(WARNING, "not doing it!");

Yeah, in this contrived example it's hard to imagine why you'd write it any other way in the first place. But single-return isn't always worth the code contortions in more complex logic. And in some places in Pg return may be masked by macros.

Plus, I'm sure you're not the one who wrote the problem code anyway? Right?

git blame buggy_file.c

.... dammit. Yes, you were, you just forgot. Past-me, you write terrible code and your breath smells of onions.

Prevention

I wonder if a static checker could be taught to detect this issue by looking for return-paths? Or, in fact, already does? Hints welcomed, especially for something that won't spew false positives.

Debugging

I edited src/include/pg_config_manual.h to enable USE_VALGRIND. Added this to elog.h:

extern void verify_errcontext_stack(void)

and this to elog.c:

void
verify_errcontext_stack(void)
{
    ErrorContextCallback *econtext;
    for (econtext = error_context_stack;
         econtext != NULL;
         econtext = econtext->previous)
    {
        Assert(econtext != NULL);
#ifdef USE_VALGRIND
        VALGRIND_CHECK_VALUE_IS_DEFINED(econtext);
        Assert(VALGRIND_CHECK_MEM_IS_ADDRESSABLE(econtext, sizeof(ErrorContextCallback)) == 0);
        VALGRIND_CHECK_VALUE_IS_DEFINED(econtext->previous);
        if (econtext->previous != NULL)
        {
            VALGRIND_CHECK_MEM_IS_ADDRESSABLE(econtext->previous, sizeof(ErrorContextCallback));
            VALGRIND_CHECK_VALUE_IS_DEFINED(econtext->previous->callback);
            VALGRIND_CHECK_VALUE_IS_DEFINED(econtext->previous->previous);
        }
#endif
        Assert(econtext->previous == NULL
               || econtext->previous->callback != NULL);
    }
}

then scattered calls to it around elog.c.

This helped detect the fault earlier. It proved important to also build with optimisation entirely disabled:

CFLAGS="-O0 -ggdb -g3" ./configure --prefix=/home/craig/pg/10 --enable-cassert --enable-debug --enable-tap-tests

With these two changes, crashes occurred much closer to the actual callsite, relatively shortly after the problem. If you scatter calls around, particularly before and after calls to any function that sets up an error context, it'll help you narrow things down quickly.

Tags: c, debugging, PostgreSQL, programming, valgrind
Share this entry
  • Share on Facebook
  • Share on Twitter
  • Share on WhatsApp
  • Share on LinkedIn
4 replies
  1. Dianne Skoll
    Dianne Skoll says:
    February 25, 2018 at 3:51 pm

    Couldn’t macros be created for this where unbalanced macros would cause compilation failure, similar to the usual implementations of pthread_cleanup_push and pthread_cleanup_pop?

    Reply
    • Dianne Skoll
      Dianne Skoll says:
      February 25, 2018 at 3:53 pm

      … though I suppose this wouldn’t fix this particular bug.

      Reply
      • craig.ringer
        craig.ringer says:
        February 26, 2018 at 8:20 am

        Right, it looks like you’re just talking about how the implementation of pthread_cleanup_push may include a { to open a new scope.

        This doesn’t stop the caller from returning from within this scope. Nor does it stop them matching a pthread_cleanup_push with a bare } by mistake due to an editing error.

        We’d really need something a bit smarter than is, AFAIK, offered by C compilers. Some sort of static assertion that the return path must come through a pair function of this call.

        Reply
    • craig.ringer
      craig.ringer says:
      February 26, 2018 at 3:29 am

      Right, it likely would, and we should adopt that. I was wondering about it, but unsure how to implement it.

      To guard against stray returns, etc, we’d probably need some sort of static checking annotations/helpers.

      Reply

Leave a Reply

Want to join the discussion?
Feel free to contribute!

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Search

Get in touch with us!

Recent Posts

  • Random Data December 3, 2020
  • Webinar: COMMIT Without Fear – The Beauty of CAMO [Follow Up] November 13, 2020
  • Full-text search since PostgreSQL 8.3 November 5, 2020
  • Random numbers November 3, 2020
  • Webinar: Best Practices for Bulk Data Loading in PostgreSQL [Follow Up] November 2, 2020

Featured External Blogs

Tomas Vondra's Blog

Our Bloggers

  • Simon Riggs
  • Alvaro Herrera
  • Andrew Dunstan
  • Craig Ringer
  • Francesco Canovai
  • Gabriele Bartolini
  • Giulio Calacoci
  • Ian Barwick
  • Marco Nenciarini
  • Mark Wong
  • Pavan Deolasee
  • Petr Jelinek
  • Shaun Thomas
  • Tomas Vondra
  • Umair Shahid

PostgreSQL Cloud

2QLovesPG 2UDA 9.6 backup Barman BDR Business Continuity community conference database DBA development devops disaster recovery greenplum Hot Standby JSON JSONB logical replication monitoring OmniDB open source Orange performance PG12 pgbarman pglogical PG Phriday postgres Postgres-BDR postgres-xl PostgreSQL PostgreSQL 9.6 PostgreSQL10 PostgreSQL11 PostgreSQL 11 PostgreSQL 11 New Features postgresql repmgr Recovery replication security sql wal webinar webinars

Support & Services

24/7 Production Support

Developer Support

Remote DBA for PostgreSQL

PostgreSQL Database Monitoring

PostgreSQL Health Check

PostgreSQL Performance Tuning

Database Security Audit

Upgrade PostgreSQL

PostgreSQL Migration Assessment

Migrate from Oracle to PostgreSQL

Products

HA Postgres Clusters

Postgres-BDR®

2ndQPostgres

pglogical

repmgr

Barman

Postgres Cloud Manager

SQL Firewall

Postgres-XL

OmniDB

Postgres Installer

2UDA

Postgres Learning Center

Introducing Postgres

Blog

Webinars

Books

Videos

Training

Case Studies

Events

About Us

About 2ndQuadrant

What does 2ndQuadrant Mean?

News

Careers 

Team Profile

© 2ndQuadrant Ltd. All rights reserved. | Privacy Policy
  • Twitter
  • LinkedIn
  • Facebook
  • Youtube
  • Mail
PostgreSQL Maximum Table Size PostgreSQL Meltdown Benchmarks
Scroll to top
×