Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX - Mailing list pgsql-hackers
From | Robert Treat |
---|---|
Subject | Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX |
Date | |
Msg-id | CAJSLCQ3JLooLZBe-k=FGmJ-qYpAQB-ErnSPRo=6Ob4+Gy2CA4A@mail.gmail.com Whole thread Raw |
In response to | Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX (David Rowley <dgrowleyml@gmail.com>) |
Responses |
Re: [PATCH] Re: Proposal to Enable/Disable Index using ALTER INDEX
|
List | pgsql-hackers |
On Fri, Jun 6, 2025 at 8:04 PM David Rowley <dgrowleyml@gmail.com> wrote: > On Fri, 6 Jun 2025 at 14:32, Robert Treat <rob@xzilla.net> wrote: > > In production, you aren't watching to see what happen with pg_stat_all_indexes, because you will first be watching pg_stat_activityto see if the plans have flipped in some way that leads to an overloaded server (extra latency, poor cachingeffects, extra buffers usage, etc). And the replicated bit? Sadly someone launched some big DML operation so you'rewaiting for that to finish so the "quick rollback" can actually get to those other servers. > > I think you've misunderstood when you'd be looking at > pg_stat_all_indexes. The time when you'd want to look at > pg_stat_all_indexes is *before* you DROP INDEX and before you ALTER > TABLE INVISIBLE the index. What you'd likely want to look for there > are indexes that have the last_idx_scan set to something far in the > past or set to NULL. > I guess you have never heard of the TREAT method of index management? :-D - Test for duplicate indexes - Reindex bloated indexes - Eliminate unused indexes - Add missing indexes - Tune indexes for generic queries The easy part of figuring out what to change, the hard part (sometimes) is getting those changes into production safely; that's the part I am focused on. > I'm curious to know if you've ever had to drop an index out of > production before? What did you think about when you'd just typed the > DROP INDEX command and were contemplating your future? How long did > you pause before pressing [Enter]? > ROFL... Uh... yes, I have had to do it at least a few times. So, years ago I used to say things like "I wish we had a way to make indexes invisible like they do in Oracle" on the regular; but as I worked through several different implementations and their potential effects, and had more and more exposure to more demanding Postgres installations, my thinking evolved. I spoke with Sami a bit about this off-list and he walked me through some of the Oracle documentation on this (I had, at best, forgot the specifics), which I think was helpful to better understand some of the allure of the alter index/guc method for many people who are used to it (and this current version of the implementation is very Oracle like), but it also crystalized my feeling that an Oracle-style implementation would be a red herring that can keep us from a better solution. > Can you list your proposed series of steps you'd recommend to a DBA > wishing to remove an index, assuming this feature exists in core as > you'd like it to? > Well, the series of steps differs depending on the nature of the system being managed. If you are running on a single node with normal traffic and resources, you just set the GUC to include the index you want to be invisible, wait for a few days (maybe no one runs monthly reports on this system?), take a quick look at your monitoring/stats to make sure things seem copacetic, and then you drop the index and reset the GUC. But of course the people who I am most worried about are the ones who are operating on high scale, high transaction, high connection, "mission critical" systems... ie. people operating in high risk environments, where things can go very bad very fast. Where safety considerations are a critical part of every deployment. In that type of environment, the GUC-only method enables you to control changes at very precise levels, so you can do things like: - run it ad-hoc at the session level to confirm that the explain plans you get in production match your expectations. - you can stay ad-hoc at the session level and run explain analyze and confirm acceptable performance within your workload, and see what kind of buffer impact you are going to have (typically overlooked, but a potential landmine for outages, but I'll come back to this) - because we are operating at the session level, we can then add this on a per query basis at the application level, and in really high traffic scenarios, you can use canary releases and/or feature flags to ramp up those new queries into the live system. - depending on how much risk you are concerned about, you can use this session level method across queries individually, or at some point roll it up to a user/application level. And again, we can roll it out to different users at different times if you want. - at some point when you feel confident that you have covered enough angles, you set the GUC globally and let that marinate for a few more weeks as needed. And the funny thing is, at this point, once you have the guc put in globally, and it's run for some number of weeks or months and everyone is confident, you don't actually need the ALTER INDEX part any more; you can just drop the index and be done with it. Now of course if you aren't running at this kind of scale or don't have this level of risk, you can speed run this a bit and go directly to the user level or skip right to adding it globally, so the ease of use is on par with using ALTER. But in any case where you do have elevated levels of risk, this is actually less steps (and less risk) that having to use the ALTER/guc method. Earlier I mentioned the idea of monitoring buffer impact; let's talk about that. I often hear people say that you should be doing things like confirming your explain plans in development or have some type of staging system where you do these kind of "experiments", as if a test on a secondary system could really give you absolute confidence when deploying to a system that automatically updates its settings (ie pg_stats) at semi-random times with randomly sampled values; but in any case, most people will at least agree that there is no way to match up buffer usage across machines. That means if we are making production changes that might have a significant impact on buffers, we are doing something inherently dangerous. Well, dropping an index is one of those things. Imagine a scenario where you have a large index on a column and a similar partial index on the same column, which are both used in production for different queries, and therefore taking up some amount of space within the buffer pool. When you make the partial index invisible, the index is still maintained, and therefore it likely still needs to maintain pages within the buffer pool to stay updated. However, with queries now shifting to the full index, the full index may very well need to pull in additional pages into the buffer pool that it didn't need before, and this action can cause other pages from some unknown object to get evicted. If you are lucky, this all works itself and nothing bad happens, if you aren't, you may end up with a server overloaded by latency in queries that aren't even related to the indexes you're working on. (If you have a hard time seeing it with partial indexes, the same can happen with consolidating indexes with different INCLUDE statements, and certainly will be a scenario when people look to drop indexes by way of skip-scan based plans). Now, is it possible to handle this with the ALTER/guc method? Well, you can mitigate it somewhat, but ironically to do so requires pushing out the guc part of the ALTER/guc to all the places you would have pushed out the GUC-only method, and that has to have been done BEFORE running ALTER INDEX, so what does it really buy you? I suppose while we're here, let me also make some observations about how these methods differ when dealing with replica clusters. You mentioned that one of the things you liked about the ALTER/guc method is that it replicates the changes across all systems which makes it easy to revert, however I believe that thinking is flawed. For starters, any change that has to occur across the WAL stream is not something that can be relied on to happen quickly; there are too many other items that traverse that space that could end up blocking a rollback from being applied in a timely fashion. The more complex the replica cluster, the worse this is. One very common use case is to run different workloads on different nodes, with the ALTER/guc method, you are forcing users to make changes on a primary when they want to target a workload that only runs on a replica. This means I have to account for all potential workloads on all clusters before I can safely start making changes, and to the degree that the ALTER/guc gives me a safety net, that safety net is... to deploy a guc globally, one at a time, on each individual server. I feel like this email is already long, and tbh I could go on even more, but hopefully I've covered enough to help explain some of the issues that are involved here. I'm not trying to say that GUC-only is a perfect solution, but I do think it handles every use case on par with ALTER/guc, and enables some use cases ALTER/guc can't, especially for people who have to operate in risk-first environments. And I get it that some people are going to want a thing that looks very simple or is familiar to how Oracle did it, but I can't help but think this is one of those cases like how people used to always ask us to implement UPSERT because that's what MySQL had, but instead we gave them INSERT ON CONFLICT because it was the better solution to the problem they (actually) had. Robert Treat https://xzilla.net
pgsql-hackers by date: