Bugra Parlayan | Oracle Database & Exadata Blog

Jira Service Management and OEM 24AI Test Fail

Bugra Parlayan — Sun, 28 Sep 2025 14:24:29 +0000

Oracle Enterprise Manager 24AI and Jira integration is explained clearly in the Oracle documents. If you see an authentication error during the integration, it is most likely related to the way you are doing the authentication.

The documentation is for Jira Cloud version, so it uses the method Authorization: Basic. As you can see in the document, the base64 encoding is done with email + token. But if you are not using Jira Cloud (for example, the Data Center version), the authentication should be done with Authorization: Bearer.

Another point is that if you have not made a different configuration in Jira, you need to log in with your username, not your email address. This is because, by default, Jira uses the local user database.

Sample error;

ExecuteThread: ’22’ for queue: ‘weblogic.kernel.Default (self-tuning)’] ERROR framework.Method logp.269 – REST call returned an unsuccessful status: 401

ExecuteThread: ’22’ for queue: ‘weblogic.kernel.Default (self-tuning)’] ERROR framework.Method logp.269 – REST call returned an unsuccessful status: 403

Below you can see an example of creating a ticket with Curl using Authorization: Bearer. If you update the parameters according to your own configuration, you will see that a sample ticket is created in Jira.

curl -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your_token" \
  -d '{
        "fields": {
          "project": { "key": "BGR" },
          "issuetype": { "name": "Task" },
          "summary": "Test Ticket from OEM",
          "priority": { "name": "Minor" },
          "components": [{ "name": "Jira" }],
          "assignee": { "name": "your_token_username" }
        }
      }' \
  https://jira.bugraparlayan.com.tr/rest/api/2/issue

After this, if you change the lines in the Jira templates from Authorization: Basic to Authorization: Bearer, you can easily integrate with Jira.

However, this is still not enough to create tickets automatically. You also need to change other lines in the template according to your Jira setup. These lines are explained inside the template.

If you still have a problem, you can send me an email.

Ref:

https://docs.oracle.com/en/enterprise-manager/cloud-control/enterprise-manager-cloud-control/24.1/jirac/installing-and-configuring-jira-connector.html#GUID-7C078A53-99FA-4303-B48B-F9EF0F0D6032

The post Jira Service Management and OEM 24AI Test Fail appeared first on Bugra Parlayan | Oracle Database & Exadata Blog.

Multi-Factor Authentication (MFA) in Oracle Database RU19.28

Bugra Parlayan — Sat, 30 Aug 2025 16:56:38 +0000

Introduction: Fortifying the Core – The Imperative of MFA for Oracle Database

In the contemporary digital ecosystem, the sophistication of cyber threats necessitates a security posture that extends beyond traditional perimeter defenses. Multi-factor authentication (MFA) has emerged as a foundational pillar of modern security architecture. It operates on the principle of requiring two or more distinct verification factors to grant access, thereby creating a layered defense that significantly diminishes the risk of unauthorized entry. For enterprise systems like the Oracle Database, which house an organization’s most critical data assets, implementing MFA is no longer an optional enhancement but a critical imperative for mitigating data breaches and ensuring operational integrity.

The drive towards stronger authentication is also heavily influenced by a stringent and evolving regulatory landscape. Global compliance frameworks and standards, including the Payment Card Industry Data Security Standard (PCI DSS), the European Union’s Digital Operational Resilience Act (DORA) and NIS2 Directive, and the National Institute of Standards and Technology (NIST) guidelines 800-63 and 800-53, increasingly mandate or strongly recommend robust authentication measures like MFA. Oracle’s native and integrated MFA capabilities provide a direct pathway for organizations to meet these demanding compliance requirements, safeguarding sensitive data and avoiding potential penalties.

This guide provides a comprehensive, expert-level walkthrough for implementing the primary MFA architectures supported by Oracle Database. These methods offer flexibility to align with diverse enterprise security strategies and infrastructures:

Push Notification-Based MFA: A modern, user-friendly approach integrating with mobile applications such as Oracle Mobile Authenticator (OMA) and Cisco Duo, allowing for real-time approval or denial of login attempts.
Certificate-Based MFA: A highly secure method leveraging Public Key Infrastructure (PKI), where users are authenticated using a combination of a password and a digital certificate stored on a physical device like a smart card.
RADIUS Integration: A versatile, standards-based protocol that enables the Oracle Database to connect with a wide array of third-party authentication managers, including token-based systems and centralized identity providers.

As Oracle continues to evolve its security offerings, a strategic shift towards simplifying MFA is on the horizon. Future releases are planned to include native MFA support for local database accounts, streamlining the process for administrators and broadening the scope of protection. This forward-looking development underscores Oracle’s commitment to embedding advanced security directly into the core of its database platform.

Section 1: Foundational Concepts, Architecture, and Prerequisites

Before embarking on the configuration of a specific MFA method, it is crucial to understand the underlying architecture, licensing landscape, and universal prerequisites. A successful implementation depends on a solid foundation of correctly configured components and a clear understanding of how they interact.

1.1 Understanding the MFA Architecture: An Integration-Centric Model

Oracle Database MFA is not a monolithic, self-contained feature. Instead, it functions as a sophisticated integration between the database and an external authentication authority. This architecture involves several key components working in concert: the Oracle Database, the Oracle Net listener, a server-side Oracle Wallet for storing secrets, and an external Identity Provider (IdP) or authentication server, such as Oracle Cloud Infrastructure (OCI) Identity and Access Management (IAM), Cisco Duo, or a RADIUS server.

This integration-centric model has significant operational implications. The database effectively acts as a client to these external identity services. This is evident in the database initialization and network configuration parameters, which point to external network endpoints like $MFA_DUO_API_HOST, $MFA_OMA_IAM_DOMAIN_URL, and $SQLNET.RADIUS_AUTHENTICATION. Consequently, a successful MFA transaction depends not only on a correctly configured database but also on a properly configured IdP, a functional email relay (SMTP) for enrollment notifications, and correctly defined network paths and firewall rules allowing communication between these systems. Troubleshooting MFA failures is therefore often a multi-disciplinary effort, requiring collaboration between database administrators, network engineers, security architects, and cloud identity administrators.

The Oracle Wallet plays a pivotal role in this architecture. It serves as the secure repository for the credentials the database uses to authenticate itself to the external IdP. These secrets can include API integration keys, client secrets, or service account passwords. Utilities such as mkstore and orapki are used to create and manage these wallets and the secrets stored within them, ensuring that sensitive authentication data is protected on the database server’s file system.

1.2 Licensing Considerations: Navigating the Ambiguity

A frequent and critical question for any new feature implementation pertains to licensing. Oracle’s primary database licensing models, Named User Plus (NUP) and Processor, are well-documented for Standard Edition 2 and Enterprise Edition. However, the technical documentation detailing the configuration of MFA features for Oracle Database 19c does not explicitly state a requirement for any extra-cost options, such as the Oracle Advanced Security Option (ASO). This omission is notable, as Oracle is typically explicit about features that require additional licenses.

Further reinforcing this observation, official communications regarding upcoming native MFA capabilities also lack any mention of specific licensing prerequisites. This consistent absence of an explicit licensing dependency may represent a strategic decision by Oracle. By not tying this critical security feature to an additional license, Oracle lowers the barrier to adoption, making it easier for customers to secure their databases and meet the stringent compliance mandates that are increasingly prevalent in the industry. Enhancing the platform’s baseline security posture makes the Oracle Database a more resilient and attractive option for enterprises.

Despite this analysis, it is imperative to recognize that official license agreements are the ultimate source of truth. Organizations should always consult their specific Oracle Master Agreement (OMA), ordering documents, and the official Database Licensing Information User Manual to ensure full compliance before deploying any new features.

1.3 Universal Prerequisites and Server Configuration

Regardless of the chosen MFA method, several universal configuration steps must be performed on the database server to prepare the environment.

Network Timeouts: For push notification-based methods (OMA and Duo), the database connection must remain open while waiting for the user to respond to the prompt on their mobile device. The default timeout may be too short, leading to failed login attempts. It is essential to edit the sqlnet.ora file on the database server and set the $SQLNET.INBOUND_CONNECT_TIMEOUT parameter to a value greater than 60 seconds to accommodate this delay.
Wallet Creation and Location: The Oracle Wallet is required to store secrets. If a wallet does not already exist in the designated location, it must be created. The orapki utility can be used to create a new auto-login wallet. The wallet location is specific to the container: for the CDB root, it is /mfa, and for Pluggable Databases (PDBs), it is //mfa.
SMTP Integration for Enrollment: Push-based MFA methods rely on email to send enrollment links to users. The database must be configured to communicate with an SMTP server. This involves setting several initialization parameters ($MFA_SMTP_HOST, $MFA_SMTP_PORT, $MFA_SENDER_EMAIL_ID, $MFA_SENDER_EMAIL_DISPLAYNAME). If the SMTP server requires authentication, the service account credentials must be securely stored in the Oracle Wallet.
Password File Format: A critical prerequisite for enabling MFA on administrative accounts (users with SYSDBA, SYSOPER, etc., privileges) is the format of the password file. The password file must be in the 12.2 format or later. If the file is in an older format, any attempt to add an MFA factor to a privileged user will fail. The orapwd utility can be used to check the format and migrate the password file if necessary.

The following table provides a consolidated reference for the key parameters involved in configuring Oracle Database MFA.

Parameter Name	File Location	Applicable MFA Methods	Description
`$MFA_DUO_API_HOST`	`init.ora`	Cisco Duo	Specifies the API hostname for the organization’s Cisco Duo account.
`$MFA_OMA_IAM_DOMAIN_URL`	`init.ora`	Oracle Mobile Authenticator (OMA)	Specifies the domain URL for the OCI IAM identity domain being used.
`$MFA_SMTP_HOST`	`init.ora`	OMA, Cisco Duo	The hostname or IP address of the SMTP server for sending enrollment emails.
`$MFA_SMTP_PORT`	`init.ora`	OMA, Cisco Duo	The port number for the SMTP server (e.g., 587 for TLS).
`$MFA_SENDER_EMAIL_ID`	`init.ora`	OMA, Cisco Duo	The email address that will appear as the sender of enrollment notifications.
`$MFA_SENDER_EMAIL_DISPLAYNAME`	`init.ora`	OMA, Cisco Duo	The display name for the sender of enrollment emails.
`$SQLNET.INBOUND_CONNECT_TIMEOUT`	`sqlnet.ora`	OMA, Cisco Duo	End-to-end authentication timeout. Must be > 60 seconds for push notifications.
`$SQLNET.AUTHENTICATION_SERVICES`	`sqlnet.ora`	RADIUS, Certificate	Defines the authentication methods to be used (e.g., `(RADIUS)`, `(TCPS)`).
`$SQLNET.RADIUS_AUTHENTICATION`	`sqlnet.ora`	RADIUS	The hostname or IP address of the primary RADIUS authentication server.
`$SQLNET.RADIUS_AUTHENTICATION_PORT`	`sqlnet.ora`	RADIUS	The port number for the RADIUS server (default is 1812).
`$SQLNET.RADIUS_SECRET`	`sqlnet.ora`	RADIUS	The full path to the file containing the shared secret for the RADIUS client.
`$SQLNET.RADIUS_CHALLENGE_RESPONSE`	`sqlnet.ora`	RADIUS	Must be set to `ON` to enable challenge-response for OTPs and other factors.
`$WALLET_LOCATION`	`sqlnet.ora`, `listener.ora`	Certificate	Specifies the directory path of the Oracle Wallet for TLS/PKI credentials.

Section 2: Implementing Push Notification-Based MFA

Push notification MFA offers a balance of strong security and user convenience, making it a popular choice for modern applications. The Oracle Database integrates with two leading providers: Oracle Mobile Authenticator (OMA) and Cisco Duo.

2.1 Configuration with Oracle Mobile Authenticator (OMA) and OCI IAM

This method leverages an OCI IAM identity domain as the authentication authority and the OMA application for the second-factor verification.

Prerequisites: Before configuration, ensure you have an active OCI IAM account, the specific IAM domain URL, and have configured an application client within IAM that possesses both the User Administrator and MFA Client roles. The client ID and client secret for this application are required.
Step 1: Database Parameter Configuration: Set the necessary initialization parameters in the database. This includes the OMA-specific URL and the universal SMTP parameters for email notifications. SQLALTER SYSTEM SET MFA_OMA_IAM_DOMAIN_URL = "https://idcs-xxxxxxxx.identity.oraclecloud.com"; ALTER SYSTEM SET MFA_SMTP_HOST = "smtp.your-email-provider.com"; ALTER SYSTEM SET MFA_SMTP_PORT = 587; ALTER SYSTEM SET MFA_SENDER_EMAIL_ID = "db-admin@example.com";
Step 2: Securing Client Credentials in the Oracle Wallet: Use the mkstore utility to securely store the OCI IAM application’s client ID and secret in the server-side wallet. These credentials allow the database to authenticate to the IAM service. Bashmkstore -wrl./ -createEntry oracle.security.mfa.oma.clientid mkstore -wrl./ -createEntry oracle.security.mfa.oma.clientsecret
Step 3: OCI IAM Configuration: A critical administrative step within OCI IAM is to adjust the validity period for the enrollment token. By default, the enrollment link sent to users expires in 300 seconds (5 minutes), which is often impractical. It is highly recommended to extend this period to 86,400 seconds (24 hours) via the IAM API to prevent user enrollment failures. Bashcurl -X PATCH "${DOMAIN_URL}/admin/v1/AuthenticationFactorSettings/AuthenticationFactorSettings" \ -H 'Content-Type: application/json' -H "Authorization: Bearer $AUTH_TOKEN" \ -d '{"schemas":["urn:ietf:params:scim:api:messages:2.0:PatchOp"],"Operations":}'
Step 4: User Enrollment and Management: With the backend configured, administrators can now enable MFA for database users using SQL commands.SQL-- For a new user CREATE USER jsmith IDENTIFIED BY "ComplexPassword1" AND FACTOR 'OMA_PUSH' AS 'jason.smith@example.com'; -- For an existing user ALTER USER jsmith ADD FACTOR 'OMA_PUSH' AS 'jason.smith@example.com'; Executing these commands does more than modify the database user account; it triggers an identity provisioning event. The database communicates with OCI IAM to create a corresponding user account in the identity domain and dispatches an enrollment email. This tight integration creates a “dual identity” model and introduces an important lifecycle management consideration. While the database automates user creation in the IdP, the process for de-provisioning or handling changes requires a clearly defined administrative procedure to prevent orphaned accounts in the IAM system. The end-user receives an email with a QR code, which they scan with the OMA app to register their device and complete the enrollment process.

2.2 Configuration with Cisco Duo

This method integrates the database with the Cisco Duo security platform for second-factor authentication.

Prerequisites: An active Cisco Duo account is required. From the Duo administrative console, you will need to obtain three key pieces of information for the integration: the API hostname, the Integration key, and the Secret key.
Step 1: Database Parameter Configuration: Set the Duo-specific API host parameter along with the standard SMTP parameters in the database’s initialization file. SQLALTER SYSTEM SET MFA_DUO_API_HOST = "api-xxxxxxxx.duosecurity.com"; ALTER SYSTEM SET MFA_SMTP_HOST = "smtp.your-email-provider.com"; ALTER SYSTEM SET MFA_SMTP_PORT = 587;
Step 2: Securing Duo Credentials in the Oracle Wallet: Use the mkstore utility to store the Duo Integration key and Secret key in the server’s Oracle Wallet. These keys are used by the database to securely communicate with the Duo API. Bashmkstore -wrl./ -createEntry oracle.security.mfa.duo.integrationkey mkstore -wrl./ -createEntry oracle.security.mfa.duo.secretkey
Step 3: User Enrollment and Management: Administrators can enable Duo MFA for users via SQL. The process is analogous to the OMA configuration. SQL-- For a new user CREATE USER jsmith IDENTIFIED BY "ComplexPassword1" AND FACTOR 'DUO_PUSH' AS 'jason.smith@example.com'; -- For an existing user ALTER USER jsmith ADD FACTOR 'DUO_PUSH' AS 'jason.smith@example.com'; Similar to the OMA integration, this command initiates the creation of the user within the Cisco Duo account if they do not already exist. An enrollment email is sent to the user, containing a link that expires after 24 hours. The same “dual identity” lifecycle management considerations apply here as with OMA. The user follows the link to register their device with the Cisco Duo mobile application.

Section 3: Implementing Certificate-Based MFA (PKI)

Certificate-based authentication represents a highly secure, albeit more complex, form of MFA. It combines something the user knows (a password) with something the user has (a private key securely stored with a digital certificate), often on a hardware token or smart card.

3.1 Principles of Certificate-Based Authentication

This method is built on Public Key Infrastructure (PKI). It establishes a mutual Transport Layer Security (mTLS) connection, where both the client and the database server present digital certificates to prove their identities to one another. The database server is configured to trust certificates issued by a specific Certificate Authority (CA). When a user connects, they present their personal certificate. The server validates the certificate against its trusted CAs and then maps the certificate’s subject Distinguished Name (DN)—for example,

cn=jason.smith—to a specific database user account to authorize access.

3.2 Server-Side Configuration

Step 1: Configure the Listener for TCPS: The Oracle Net Listener must be configured to accept secure connections. This involves editing the listener.ora file to add a TCPS protocol endpoint on a dedicated port (e.g., 1522) and specifying the WALLET_LOCATION that holds the server’s certificate and private key.
Step 2: Create and Populate the Server Wallet: Using the orapki command-line utility, an administrator creates a server wallet, generates a Certificate Signing Request (CSR), and submits it to a trusted internal or external CA. Once the signed server certificate is received, it is imported back into the wallet along with the CA’s root and any intermediate certificates to complete the trust chain.
Step 3: Configure sqlnet.ora on the Server: The server’s sqlnet.ora file must be updated to specify the WALLET_LOCATION and to include TCPS in the $SQLNET.AUTHENTICATION_SERVICES parameter list, enabling it to process secure connections.

3.3 Client-Side Configuration

Step 1: Configure sqlnet.ora on the Client: The client machine’s sqlnet.ora file must also be configured. It needs a WALLET_LOCATION entry pointing to the directory containing the user’s personal wallet (with their certificate and private key) and must have $SQLNET.AUTHENTICATION_SERVICES=(TCPS).
Step 2: Configure tnsnames.ora on the Client: A new TNS entry must be created in the client’s tnsnames.ora file. This entry will specify PROTOCOL=TCPS and the port number configured for the secure listener on the server.

3.4 User Enrollment for Certificate MFA

Once the client and server infrastructure is in place, administrators can associate a user’s certificate with their database account.

The CREATE USER and ALTER USER commands are used with the FACTOR 'CERT_AUTH' clause. The AS clause must contain the subject DN from the user’s certificate that will be used for mapping. SQL-- For a new user CREATE USER jsmith IDENTIFIED BY "ComplexPassword1" AND FACTOR 'CERT_AUTH' AS 'cn=jason.smith'; -- For an existing user ALTER USER jsmith ADD FACTOR 'CERT_AUTH' AS 'cn=jason.smith';

When the user connects using the secure TNS alias, they will be prompted for their password. The Oracle client will then present their certificate from the wallet. The database validates both the password and the certificate, completing the multi-factor authentication process.

Section 4: Implementing RADIUS-Based MFA

Remote Authentication Dial-In User Service (RADIUS) is a widely adopted networking protocol that provides centralized Authentication, Authorization, and Accounting (AAA) management. Integrating Oracle Database with a RADIUS server allows it to leverage existing enterprise two-factor authentication solutions, such as those based on hardware tokens, OTPs, or other third-party systems.

4.1 RADIUS Authentication Flow

The authentication process involves a coordinated exchange between multiple components. First, the user initiates a connection to the Oracle Database using their credentials. The database, acting as a RADIUS client, forwards the authentication request to the configured RADIUS server. The RADIUS server then processes the request, often by validating the first factor against a directory service like LDAP or Active Directory. If the first factor is valid, the RADIUS server issues a challenge for the second factor (e.g., “Enter your OTP”). The database relays this challenge to the user’s client. The user provides the second factor (e.g., a six-digit code from their authenticator app), which is sent back through the database to the RADIUS server for final validation. Upon successful validation, the RADIUS server sends an “Access-Accept” message, and the database grants the connection.

4.2 Database Configuration for RADIUS

Step 1: Register the Database as a RADIUS Client: In the RADIUS server’s administrative interface, the Oracle Database server must be registered as a new RADIUS client. This process typically involves providing the database server’s IP address and defining a sharedSecret. This secret is a pre-shared key used to encrypt communication between the database and the RADIUS server.
Step 2: Configure sqlnet.ora: The server-side sqlnet.ora file is the primary location for configuring the RADIUS integration. Several parameters are required :
- $SQLNET.AUTHENTICATION_SERVICES=(RADIUS): This tells Oracle Net to use the RADIUS authentication adapter.
- $SQLNET.RADIUS_AUTHENTICATION: The hostname or IP address of the RADIUS server.
- $SQLNET.RADIUS_AUTHENTICATION_PORT: The UDP port for RADIUS authentication, typically 1812.
- $SQLNET.RADIUS_SECRET: The full file system path to the radius.key file containing the shared secret.
- $SQLNET.RADIUS_CHALLENGE_RESPONSE = ON: This parameter is essential for enabling the interactive challenge-response mechanism required for OTPs.
Step 3: Secure the Shared Secret: The shared secret obtained from the RADIUS server must be stored in a file on the database server. This file (e.g., $ORACLE_HOME/network/security/radius.key) should contain only the secret string. For security, its file system permissions must be restricted so that it is readable only by the Oracle software owner.

4.3 User Configuration and Login Experience

Since authentication is fully delegated to the external RADIUS infrastructure, database users must be created in a specific way.

Users are created using the IDENTIFIED EXTERNALLY clause. This instructs the Oracle Database not to manage the user’s password but to rely on the external authentication service (RADIUS) to verify the user’s identity. The database username must match the username in the directory that the RADIUS server authenticates against. SQLCREATE USER user1 IDENTIFIED EXTERNALLY; GRANT CREATE SESSION TO user1;
When a user connects via a tool like SQL*Plus, they will first be prompted for their password (the first factor). If successful, an “Oracle – Challenge” window will appear, prompting for the second factor, such as an OTP. After entering the correct OTP, the login succeeds, and the user is granted a session in the database.

Section 5: Administration, Auditing, and Troubleshooting

Effective long-term management of MFA requires established procedures for user administration, a robust auditing strategy, and a clear understanding of how to troubleshoot common issues.

5.1 Ongoing User Administration

Day-to-day management of MFA for users is handled through standard SQL commands. The ALTER USER statement can be used to add, modify, or remove MFA factors. As noted previously, administrators must be mindful of the “dual identity” lifecycle. When removing an MFA factor or dropping a database user, a corresponding action may be required in the external IdP (OCI IAM or Duo) to de-provision the user and prevent orphaned accounts. While it is possible to configure MFA for database links, this is not a typical use case and should be approached with caution, as it could complicate automated scripts and processes.

5.2 Auditing MFA Events

Oracle’s unified audit trail provides comprehensive capabilities for monitoring security-related events, including MFA. When a user successfully authenticates, a standard login audit record is generated. More importantly, when a second-factor authentication attempt fails, the details of the failure are captured in the additional_info column of the unified audit trail view (UNIFIED_AUDIT_TRAIL). This allows security administrators to monitor for suspicious activity, such as repeated MFA failures for a specific user, which could indicate a compromised password or an attack in progress.

5.3 Common Issues and Troubleshooting

A structured approach to troubleshooting is essential for resolving MFA-related login problems efficiently.

Problem: User Enrollment Fails or Emails Are Not Received.
- Causes & Solutions: This issue typically points to a problem in the email delivery chain. Administrators should first verify the $MFA_SMTP_* parameters in the init.ora file. If the SMTP server requires authentication, confirm that the correct credentials are stored in the Oracle Wallet. Network connectivity issues, such as firewalls blocking the SMTP port between the database server and the email relay, are also common culprits. For OMA, it is crucial to verify that the enrollment link has not expired due to the jwtValidityDurationInSecs setting in OCI IAM.
Problem: Connection Fails with Timeout Error.
- Causes & Solutions: The most likely cause is the $SQLNET.INBOUND_CONNECT_TIMEOUT parameter in sqlnet.ora being set too low for push notifications. This value must be greater than 60 seconds. If the parameter is set correctly, investigate potential network latency or firewall issues between the database server and the external IdP (Duo or OCI IAM) that could be delaying the API calls.
Problem: User Cannot Log In / Account is Locked.
- Causes & Solutions: This can result from multiple issues, including an incorrect password, an invalid OTP, or too many failed MFA attempts leading to an account lockout in the external IdP. A critical operational scenario is when a user loses or replaces their mobile device. Administrators must have a defined process for resetting a user’s MFA enrollment in the IdP. Proactively generating and securely distributing one-time bypass codes can provide users with an emergency access method while their device is being restored.
Problem: Administrator Cannot Enable MFA for a User.
- Causes & Solutions: While some Oracle cloud services require users to self-enroll in MFA, enabling MFA for the Oracle Database is an administrative action performed via ALTER USER. If this command fails, check for other prerequisites, such as ensuring the password file is in the 12.2 format for privileged users. Confusion can arise from different workflows across the Oracle ecosystem, but for the database, the DBA is in control of enrollment.
Problem: RADIUS Authentication Fails.
- Causes & Solutions: RADIUS failures often stem from configuration mismatches. Verify that the shared secret in the radius.key file exactly matches the one configured on the RADIUS server. Double-check the RADIUS server’s IP address or hostname and port in the sqlnet.ora file. Confirm that no firewalls are blocking the UDP traffic on the RADIUS port between the database server and the RADIUS server. Finally, ensure the user exists and is active in the backend directory (e.g., Active Directory) that the RADIUS server is configured to use for authentication.

The following table lists common error codes related to MFA and provides actionable guidance for database administrators.

Error Code	Error Message	DBA Action/Resolution
`AUTH-3001`	You entered an incorrect username or password.	The first authentication factor (password) failed. This is a standard password issue, not an MFA-specific problem. Advise the user to verify their password.
`AUTH-3002`	Your account is locked. Contact your system administrator.	The user’s account is locked, likely in the external IdP (OCI IAM, Duo) or backend directory (AD/LDAP) due to too many failed login attempts. The administrator must unlock the account in the respective system.
`AUTH-3024`	You aren’t authorized to access the app. Contact your system administrator.	The user’s credentials are valid, but they are not authorized by a policy in the external IdP to access the database application. Check sign-on policies in OCI IAM or Duo.
`AUTH-3029`	MFA is enabled for the user. The user must provide a second factor of authentication in addition to password authentication.	This is an informational code indicating the password was correct and the database is now waiting for the second factor. If login fails after this, troubleshoot the second-factor path (push notification, OTP, etc.).
`AUTH-1002`	The {0} authentication factor is not supported or enabled.	The MFA method the user is attempting to use has not been enabled in the IdP’s policy. The administrator needs to enable the required factor (e.g., Push, SMS) in the OCI IAM or Duo configuration.
`AUTH-1007`	Authentication failed.	This is a generic second-factor failure. It could be an incorrect OTP, a denied push notification, or a timeout. Check the IdP logs for specific details.
`AUTH-1016`	You can’t skip enrollment. 2-Step Verification is required.	A policy in the IdP requires the user to enroll in MFA, but they are attempting to bypass it. The user must complete the enrollment process.

Conclusion: A Multi-Layered Security Posture for the Future

The implementation of Multi-Factor Authentication is a transformative step in securing an Oracle Database environment. By moving beyond the limitations of single-factor password authentication, organizations can significantly mitigate the risks associated with compromised credentials, which remain a primary vector for data breaches. Each of the methods detailed in this guide—push notification, certificate-based, and RADIUS—provides a robust mechanism for enforcing stronger identity verification, helping organizations align with a zero-trust security philosophy and meet stringent regulatory compliance mandates.

The future of Oracle Database security points towards even tighter and more seamless integration of these critical controls. The planned introduction of native MFA support for local database accounts, anticipated in the July 2025 Database Release Update (DBRU), marks a significant evolution. This development will simplify the architecture by reducing the dependency on external identity providers for basic database accounts, allowing administrators to directly secure even legacy or application-specific schemas. This strategic direction demonstrates Oracle’s ongoing commitment to embedding advanced security features into the core database engine, making them more accessible and manageable for all customers.

Ultimately, adopting one of the MFA strategies outlined here is not merely a technical exercise; it is a fundamental step in building a resilient, multi-layered security posture. In an era of persistent and sophisticated threats, fortifying access to critical data platforms is an essential responsibility for every data custodian.

https://docs.oracle.com/en/database/oracle/oracle-database/19/newft/new-features-19c-release-updates.html#GUID-307E2F26-E7B8-4FCC-927E-6CA35D608E84

https://docs.oracle.com/en/database/oracle/oracle-database/19/dbseg/part_1.html

The post Multi-Factor Authentication (MFA) in Oracle Database RU19.28 appeared first on Bugra Parlayan | Oracle Database & Exadata Blog.

Creating an ASM Disk on Exadata FlashDisk

Bugra Parlayan — Sat, 19 Jul 2025 19:06:23 +0000

Recently, we noticed performance drops at certain times on an Exadata server operating as a data warehouse (DWH). Upon analyzing AWR reports, the following wait event stood out:

direct path read/write temp waits

We knew this event was caused by temporary (temp) data being written to or read from disk during query execution. Although we had applied various SQL optimizations, they did not significantly reduce the wait times.

Root Cause: Heavy TEMP Tablespace Usage

Large SQL queries in data warehouses—especially those involving operations like hash joins, sorts, and group by—require extensive use of TEMP space. In our case, the TEMP tablespace was located on high-latency disk groups by default, which increased IO wait times.

Solution: Move TEMP Tablespace to FlashDisk on Exadata

To directly improve performance, we decided to move the TEMP tablespace to the FlashDisk layer on Exadata. This layer offers much higher IO throughput compared to traditional disks, making it particularly advantageous for handling temporary data operations.

P.S. Since we will drop and recreate the existing flash disks during this operation, there may be minor negative impacts during the process and until the flash cache warms up again. Therefore, this work should be performed during a planned and quiet maintenance window.

dcli –g cell_group –l root cellcli -e "alter flashcache all flush"
dcli -g cell_group -l root cellcli -e "LIST CELLDISK ATTRIBUTES name, flushstatus, flusherror" | grep FD
dcli -g cell_group -l root cellcli -e drop flashcache
dcli -g cell_group -l root cellcli -e create flashcache all size=20.2875976562500T;
dcli -g cell_group -l root cellcli -e CREATE GRIDDISK ALL FLASHDISK PREFIX='FLASHTMP';

sqlplus / as sysasm

alter system set asm_diskstring='o/*/DATA_*','o/*/RECO_*','o/*/FLASHTMP*';

CREATE diskgroup TEMPDG normal redundancy disk 'o/*/FLASHTMP*' attribute 'compatible.rdbms'='19.0.0.0.0', 'compatible.asm'='19.0.0.0.0', 'cell.smart_scan_capable'='TRUE', 'au_size'='4M';

sqlplus & as sysdba

CREATE TEMPORARY TABLESPACE TEMP_FLASH TEMPFILE '+ TEMPDG' SIZE 32G EXTENT MANAGEMENT LOCAL UNIFORM SIZE 1M;

ALTER DATABASE DEFAULT TEMPORARY TABLESPACE TEMP_FLASH;

DROP TABLESPACE TEMP INCLUDING CONTENTS AND DATAFILES;

The post Creating an ASM Disk on Exadata FlashDisk appeared first on Bugra Parlayan | Oracle Database & Exadata Blog.

Oracle ORA-4031 Error: Analysis and Solutions

Bugra Parlayan — Sun, 13 Jul 2025 15:23:14 +0000

Introduction: What the ORA-4031 Error Means and Why It’s Important

In Oracle database management, some errors are signs of deeper, system-wide problems, not just simple mistakes. The ORA-04031: unable to allocate bytes of shared memory error is one of the most well-known and critical examples. This error message means that Oracle cannot find a large enough piece of memory in its shared memory area, known as the System Global Area (SGA). It might look like a simple “out of memory” problem, but

ORA-04031 often points to bigger issues in database configuration, application design, or memory management.

When this error happens, it can cause slow database performance, application errors, or even stop users from connecting to the system. For a Database Administrator (DBA), it’s very important not just to fix the ORA-04031 error for the moment, but to understand its root cause to prevent it from happening again. The error is usually a “symptom,” while the real “illness” could be memory fragmentation, bad application code, or not enough resources planned.

This detailed report is designed to explain all parts of the ORA-04031 error. It will cover everything from the basic anatomy of the error, its place in Oracle’s memory structure, modern and traditional ways to diagnose it, the use of advanced tools like SQL Trace and TKPROF, quick and long-term solutions, and finally, how to prevent it. This document aims to be the ultimate guide for Oracle professionals when they face the ORA-04031 error, helping them move from reactive problem-solving to proactive system management.

Chapter 1: The Anatomy of the ORA-4031 Error: What It Is and Why It Happens

To solve the ORA-04031 error effectively, you first need to understand the message itself and the reasons behind it. This error is a result, and following the chain of causes that lead to it is the first step to a correct diagnosis.

1.1. Decoding the Error Message: “unable to allocate bytes of shared memory”

The ORA-04031 error message from Oracle gives valuable clues about the source of the problem. A typical error message looks like this: ORA-04031: unable to allocate 4160 bytes of shared memory ("shared pool","unknown object","sga heap(1,0)","modification ").

When we break down this message, we get this key information:

Requested Memory Size: The bytes value in the message (for example, 4160 bytes) shows the amount of memory that Oracle could not find at that moment. Whether this size is large or small gives the first clue about the problem. If even a small memory request fails, it usually points to memory fragmentation.
Memory Pool: This is the most important piece of information, showing which part of the SGA the error happened in. The most common values are "shared pool", "large pool", or "java pool". This information helps the DBA focus directly on the right memory area. For example, if the error is in the "large pool", you should focus on the LARGE_POOL_SIZE parameter, not SHARED_POOL_SIZE.
Memory Structure Details: The extra arguments in quotes (like "unknown object", "sga heap(1,0)", "kglseshtTable") give more technical details about what kind of memory structure Oracle was trying to allocate. This information can be valuable for advanced analysis or when working with Oracle Support.

This error message is like a report from a crime scene. It doesn’t name the culprit directly, but it gives important evidence about where and how the crime happened.

1.2. The Victim and the Culprit: Why the Process Getting the Error is Usually Innocent

A common mistake when analyzing an ORA-04031 error is to think that the process or SQL query that received the error is the problem. However, Oracle experts and documentation make a clear distinction: the process that gets the error is usually the “victim,” not the “culprit.”

Understanding this changes the entire diagnosis process. A user might get an ORA-04031 error while running a simple SELECT query or an application calling a small PL/SQL block. The memory request from this process is usually very small. The problem is not the small request itself, but the fact that at that moment, there was no single piece of free space big enough in the shared pool.

The real culprit is usually other processes or the general behavior of the application that used up or fragmented the memory over time. For example, if hundreds of different users send similar but textually different SQL queries without using bind variables, the shared pool gets filled with thousands of small, non-reusable SQL plans. This slowly fragments the memory. Eventually, when an innocent process makes a small memory request, the ORA-04031 error is triggered because there is no single piece of free space large enough.

Therefore, a DBA’s first reaction should be, “What happened to the shared pool that it can’t even handle this small request?” instead of “What’s wrong with this query?” This perspective shifts the investigation from a single query to the memory usage habits of the entire system, greatly increasing the chance of finding the true root cause.

1.3. The Main Causes: Not Enough Memory, Fragmentation, and Application Errors

The main reasons behind the ORA-04031 error can be grouped into a few main categories :

Insufficient Sizing: This is the simplest and most obvious reason. The memory pool sizes set in the database startup parameters (init.ora or spfile), like SHARED_POOL_SIZE, LARGE_POOL_SIZE, or JAVA_POOL_SIZE, might be too low for the current workload. If you are using Automatic Memory Management (AMM/ASMM), the overall SGA_TARGET or MEMORY_TARGET parameters being too small can also cause this.
Memory Fragmentation: This is the most common and hardest cause to diagnose. In this scenario, there might be enough total free memory in the shared pool, but it’s broken into small, scattered pieces. When Oracle requests a single 4 KB piece of memory, even if there’s 100 MB of total free space, the request will fail if the largest single free piece is only 2 KB. The number one cause of fragmentation is applications sending literal SQL queries without using bind variables.
Application Design Flaws: This category is the root of fragmentation.
- Using Literal SQL: Queries like SELECT * FROM employees WHERE id = 101; and SELECT * FROM employees WHERE id = 102; are seen by Oracle as two completely different queries, and each gets its own space in the shared pool. If a structure with bind variables like SELECT * FROM employees WHERE id = :p_id; was used instead, the query would be “parsed” once, and the same plan would be reused for thousands of different id values. This is the key to using the shared pool efficiently.
- Too Much Dynamic SQL: Applications that constantly generate different SQL texts cause a similar effect.
Oracle Bugs or Memory Leaks: Although less common, there might be a software bug in the Oracle version or a specific feature you are using. This bug could cause memory not to be released properly after use (a memory leak), which can lead to the shared pool running out of space over time. This should be considered, especially if the error happens repeatedly after certain operations (like an RMAN backup or DDL operations). That’s why checking for known bugs on My Oracle Support (MOS) is a standard step when investigating the problem.

Chapter 2: The Heart of the Error: Oracle Memory Architecture and SGA Components

To fully understand the ORA-04031 error, you need to know about Oracle’s memory architecture, especially the System Global Area (SGA) and its components. This error is a memory allocation problem that happens in the shared memory pools within the SGA. Therefore, knowing what these pools do and how they work is essential for diagnosis and solution.

2.1. System Global Area (SGA): The Database’s Shared Memory Area

The System Global Area (SGA) is a group of shared memory structures that contain data and control information for an Oracle database instance. The SGA is used by all server and background processes together. The SGA is allocated when a database instance starts and is released when it shuts down. This area aims to minimize disk I/O operations by keeping critical data that directly affects database performance in memory.

The SGA is made up of several main components :

Database Buffer Cache: Stores copies of data blocks read from data files.
Redo Log Buffer: Temporarily holds “redo” entries that record changes made to the database.
Shared Pool: Caches program data and shared SQL areas. This is where the ORA-04031 error happens most often.
Large Pool: An optional area used for large memory allocations.
Java Pool: Used to store Java code and data within the JVM.
Fixed SGA: An internal area that contains general information about the state of the database.

2.2. Shared Pool: Where the Error Happens Most Often

The Shared Pool is one of the most complex and dynamic components of the SGA. Almost every operation in the database touches this area. When a SQL query is run, Oracle accesses this pool. That’s why the vast majority of

ORA-04031 errors occur in this pool. The main sub-components of the Shared Pool are:

Library Cache: This is where executable SQL and PL/SQL code is cached. This is the main battleground for ORA-04031. When a SQL query is run for the first time, Oracle parses it, creates the best execution plan, and stores this plan in the Library Cache. This is called a “hard parse” and is expensive in terms of CPU and memory. If the same SQL query is run again later, Oracle reuses the existing plan from the Library Cache. This is called a “soft parse” and is much more efficient. The use of literal SQL, one of the main causes of ORA-04031, makes every query unique, eliminating the possibility of a “soft parse” and causing constant “hard parses,” which fills up the Library Cache.
Data Dictionary Cache: Holds metadata about database objects (tables, indexes, users, etc.). It’s also known as the row cache. Oracle frequently accesses this cache during SQL parsing.
Reserved Pool: This is a special area within the Shared Pool, set aside for large memory allocations. Oracle usually tries to satisfy memory requests larger than 5 KB from this reserved pool. The purpose of this mechanism is to prevent large memory requests from fragmenting the main part of the Shared Pool. A DBA can control the size of this area with the SHARED_POOL_RESERVED_SIZE startup parameter. When analyzing an ORA-04031 error, it’s important to know if the error occurred in the Reserved Pool or the general area of the Shared Pool.

2.3. Large Pool and Java Pool: Other Potential Error Spots

The ORA-04031 error message clearly states that the problem is not always in the Shared Pool; it can sometimes be in other SGA components like the Large Pool or Java Pool.

Large Pool: This is an optional memory area, and its purpose is to hold large memory allocations that are not suitable for the Shared Pool. The most important difference between the Large Pool and the Shared Pool is that it does not have an LRU (Least Recently Used) mechanism. Memory allocated here stays until it is released by the session that requested it. This prevents fragmentation for large objects. The main scenarios where the Large Pool is used are :
- Recovery Manager (RMAN): Uses large amounts of memory for I/O buffers during backup and restore operations.
- Parallel Query: Memory buffers for messaging between parallel processes are kept here.
- Shared Server: Memory allocations for the User Global Area (UGA) are made from the Large Pool. If the ORA-04031 error message contains the phrase "large pool", the solution is to directly increase the LARGE_POOL_SIZE parameter.
Java Pool: This pool is used to store all session-specific Java code and data inside the Oracle Java Virtual Machine (JVM). Even if a DBA does not use Java code directly, Oracle may use Java for some internal operations. So, this pool is always present. If the error message specifies "java pool", the problem is caused by an insufficient JAVA_POOL_SIZE parameter, and its value should be increased.

Chapter 3: Diagnosis and Analysis: Finding the Source of the Problem

When you encounter an ORA-04031 error, it’s key to follow a systematic diagnosis process without panicking to correctly identify the source of the problem. This process includes both modern automatic tools provided by Oracle and manual analysis methods based on a DBA’s experience.

3.1. First Steps: Checking the Alert Log and Trace Files

As with any critical database issue, the first place to look is the database’s alert.log file. Oracle records important events, errors, and structural changes in this file.

If the ORA-04031 error was triggered by a background process (like DM00, SMON), the error and related details are written to the alert.log. This log entry usually also includes the location of a trace file (.trc) created at the time of the error. This trace file provides invaluable information about the process state and a memory dump at the time of the error.
However, if the error was received by a user process, there may be no record in the alert.log. This once again confirms the “victim and culprit” paradigm. The error is reported to the end-user or application but may not leave a trace in the database logs. Therefore, not finding anything in the alert.log doesn’t mean there’s no problem; it just means the diagnosis needs to shift in another direction.

3.2. Oracle’s Modern Solution: Autonomous Health Framework (AHF) and MOS Troubleshooting Tool

Oracle offers powerful tools to simplify and automate the diagnosis of complex errors like ORA-04031. This modern approach saves time and minimizes human error, especially in emergencies.

Autonomous Health Framework (AHF): AHF is a set of tools that autonomously monitors database system health and proactively detects problems. For the ORA-04031 error, AHF can create a special diagnostic package called a Service Request Data Collection (SRDC). In the past, a DBA had to manually collect many different pieces of data like the alert.log, trace files, and AWR reports for Oracle Support. AHF automates this process with a single command. Running the following command as the Oracle user on the server where the error occurred is enough: $ tfactl diagcollect -srdc ora4031 AHF will ask for the time of the error and the database name, then collect all relevant diagnostic data, including from the operating system, database, and cluster software, and package it into a .zip file.
My Oracle Support (MOS) Troubleshooting Tool: The diagnostic package created by AHF can be uploaded to the ORA-04031 troubleshooting tool on the MOS portal. This web-based tool analyzes the uploaded data, compares signatures from logs and memory dumps with known errors and issues in Oracle’s vast knowledge base. As a result of the analysis, the tool usually provides a specific patch, a configuration change recommendation, or a document related to a known bug that will solve the problem. If the tool cannot find a solution, a new Service Request (SR) can be created for Oracle Support with a single click from the same interface, and all the collected diagnostic data is automatically attached to the request.

3.3. In-Depth Analysis with V$ Dynamic Performance Views

While automatic tools are powerful, an experienced DBA uses Oracle’s dynamic performance views (V$ views) to understand and analyze the problem themselves. These views provide real-time information about the database’s memory structures.

V$SGASTAT: This view shows the overall memory usage of the SGA and the distribution within the pools. A simple query like SELECT POOL, NAME, BYTES FROM V$SGASTAT ORDER BY BYTES DESC; can quickly show which pool and which type of memory allocations within the pool are taking up the most space. The size of the "free memory" row, in particular, gives an idea of the total free space in the pool.
V$SHARED_POOL_RESERVED: This is the most critical view for understanding whether the cause of the ORA-04031 error is fragmentation or insufficient space. The columns in this view provide detailed information about the health of the reserved pool.
V$LIBRARYCACHE: This view measures the effectiveness of the Library Cache. A high value in the RELOADS column indicates that SQL or PL/SQL objects are being kicked out of the cache and reloaded too frequently. This is often a sign that the shared pool is too small for the workload.
V$SHARED_POOL_ADVICE: This advisor view provides an estimate of how the library cache hit ratio and parse times would be affected if the SHARED_POOL_SIZE parameter were set to different values. This offers a data-driven approach when deciding to resize the shared pool.

The table below summarizes the most important columns in the V$SHARED_POOL_RESERVED view and their meaning in diagnosing ORA-04031.

Table 1: Meanings and Interpretation of `V$SHARED_POOL_RESERVED` Columns

Column	Meaning and Interpretation
`REQUEST_FAILURES`	The total number of times a memory request could not be met and an `ORA-04031` error was generated. A value greater than zero indicates an active memory allocation problem in the system.
`LAST_FAILURE_SIZE`	The size in bytes of the last failed memory request. This value is key to understanding the root of the problem. Whether this size is larger or smaller than the `_shared_pool_reserved_min_alloc` parameter determines if the problem is fragmentation or a general lack of space.
`REQUEST_MISSES`	The number of requests that caused an object to be flushed from the LRU (Least Recently Used) list because no space was found in the free list. A high value indicates heavy pressure on the pool and that Oracle is constantly kicking out objects to make space.
`FREE_SPACE`	The total amount of free space in the Reserved Pool in bytes. A high value here might suggest that the problem is not a lack of total free space, but likely fragmentation.
`AVG_FREE_SIZE`	The average size of a free memory chunk in the Reserved Pool. If this value is significantly smaller than `LAST_FAILURE_SIZE`, it is strong evidence that the memory is broken into small pieces and fragmentation is high.

Using this data, a critical distinction can be made about the nature of the problem. If REQUEST_FAILURES is greater than zero and LAST_FAILURE_SIZE is small (for example, smaller than the default value of the _shared_pool_reserved_min_alloc parameter, which is about 4400 bytes), this indicates that the system cannot allocate even a small block of memory. This is the clearest sign of severe memory fragmentation. The solution is not to increase the memory size, but to fix the application behavior (usually the use of literal SQL) that is causing the fragmentation.

Conversely, if LAST_FAILURE_SIZE is large, this indicates that the application legitimately needs large blocks of memory, but the space allocated by SHARED_POOL_RESERVED_SIZE is insufficient to meet these demands. In this scenario, the solution is to increase the SHARED_POOL_RESERVED_SIZE and therefore the SHARED_POOL_SIZE parameters. This simple but powerful distinction allows the DBA to focus their efforts in the right direction.

Chapter 4: Root Cause Analysis: Finding Performance Bottlenecks with SQL Trace and TKPROF

While V$ views tell us “what” and “where” the problem is (an ORA-04031 error is happening in the shared pool due to fragmentation), tools like SQL Trace and TKPROF help us answer the “why” question. These tools are used to identify the destructive application behaviors, especially the SQL queries that use the shared pool inefficiently, leading to the ORA-04031 error.

4.1. What are SQL Trace and TKPROF? Their Role in ORA-04031 Analysis

SQL Trace: A mechanism in Oracle that allows collecting detailed performance statistics for SQL queries run in a specific session or across the entire system. When enabled, it writes data for each SQL query’s parse, execute, and fetch phases—such as CPU time, elapsed time, disk reads, and logical reads—to a .trc (trace) file.
TKPROF (Transient Kernel Profiler): A command-line utility that turns the raw and hard-to-read .trc files produced by SQL Trace into a summarized and formatted report that is easy for humans to understand.

Although ORA-04031 is not directly a performance problem (like a slow query), it is often a side effect of habits that lead to poor performance. TKPROF reports are invaluable for revealing the biggest culprit of shared pool fragmentation: SQL queries that are “hard parsed” frequently and not reused.

4.2. Step-by-Step Guide to Activating SQL Trace and Creating a TKPROF Report

The following steps can be followed to trace a problematic application session:

Enable Tracing: In modern Oracle versions (10g and later), it is recommended to use the DBMS_MONITOR package for this. This provides more flexible and centralized control. After finding the SID and SERIAL# of the problematic session from V$SESSION, tracing can be started with this command:SQLEXEC DBMS_MONITOR.SESSION_TRACE_ENABLE(session_id => :sid, serial_num => :serial, waits => TRUE, binds => TRUE); Setting the waits and binds parameters to TRUE collects additional information about wait events and bind variable values, allowing for a richer analysis. Alternatively, the ALTER SESSION SET SQL_TRACE = TRUE; command can be used from within the session itself.
Run the Application: After tracing is enabled, the user or application is expected to perform the actions that cause the error. During this time, all SQL activities are recorded in the trace file.
Disable Tracing: It is very important to turn off tracing after the process is complete, otherwise it will unnecessarily consume disk space and create a performance load.SQLEXEC DBMS_MONITOR.SESSION_TRACE_DISABLE(session_id => :sid, serial_num => :serial);
Create a TKPROF Report: The trace file is located in the USER_DUMP_DEST (or DIAGNOSTIC_DEST in 11g and later) directory on the database server. A report is generated by running the tkprof command on this file:Bashtkprof explain=user/pass sort=(prscnt,exeela,fchela) sys=no The sort parameter here makes the report sort the SQL queries by specific criteria (e.g., parse count prscnt, execute time exeela), which brings the most problematic queries to the top.

4.3. Interpreting the TKPROF Report: Finding High “Parse” Counts and Resource-Hungry SQL

The TKPROF report dedicates a section to each SQL query in the traced session. The most critical part for ORA-04031 analysis is the statistics table provided for each query:

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse        1      0.00       0.00          0          0          0           0
Execute   1000      0.10       0.09          0        120          5        1000
Fetch     1000      0.05       0.04          0       2000          0        1000
------- ------  -------- ---------- ---------- ---------- ----------  ----------
total     2001      0.15       0.13          0       2120          5        2000

The red flag that points to the root cause of ORA-04031 in this table is the relationship between the numbers in the Parse and Execute columns. In a well-designed application that uses bind variables, a SQL query is parsed once (Parse count = 1) and executed thousands of times (Execute count = 1000s). This shows that the shared pool is being used efficiently.

However, if a table like the one below is seen in the TKPROF report, it indicates a serious problem:

call     count       cpu    elapsed       disk      query    current        rows
------- ------  -------- ---------- ---------- ---------- ----------  ----------
Parse     1000      0.80       0.95          0          0          0           0
Execute   1000      0.10       0.09          0        120          5        1000
...

Here, the fact that the Parse count (1000) is equal to the Execute count (1000) is definitive proof that this query is being “hard parsed” again every time it is executed. This shows that the application is not using bind variables and is sending a textually new query to Oracle each time. This behavior quickly fills up the shared pool, causes fragmentation, and eventually sets the stage for an ORA-04031 error. Queries that show this pattern in a TKPROF report are the number one targets for correction.

4.4. Identifying Non-Shared SQL: `V$SQLAREA` and the Literal SQL Problem

It’s not always possible or practical to get a trace file. Fortunately, the V$SQLAREA view, which reflects the current state of the Library Cache in the shared pool, allows for the same analysis in real-time. This view contains one row for each unique SQL query in the shared pool.

An application using literal SQL fills V$SQLAREA with queries like these:

SELECT * FROM T WHERE C = 'A'
SELECT * FROM T WHERE C = 'B'
SELECT * FROM T WHERE C = 'C'

These three queries take up space as different rows with different SQL_IDs in V$SQLAREA. To group such similar queries, Oracle creates a signature called FORCE_MATCHING_SIGNATURE. This signature represents the structure of the query, ignoring the literal values. This way, queries with the same structure but different literals can be identified.

The table below provides practical SQL queries that can be used to identify non-shareable (literal) SQL queries and the problems they cause, using the V$SQLAREA view.

Table 2: Queries to Detect Literal SQLs Using `V$SQLAREA`

Purpose	SQL Query	Interpretation
Find Queries That Are Parsed Frequently	`SELECT executions, parsing_schema_name, sql_text FROM v$sqlarea WHERE executions < 5 AND parsing_schema_name NOT IN ('SYS', 'SYSTEM') ORDER BY sql_text;`	This query lists queries that have been run very few times (`< 5`), meaning they are not being reused. Seeing many very similar queries in the result list (only differing in the values in the `WHERE` clause) is a sign of literal SQL usage.
Count How Many Different Versions of the Same Query Exist	`SELECT COUNT(), force_matching_signature FROM v$sqlarea GROUP BY force_matching_signature HAVING COUNT() > 20 ORDER BY COUNT(*) DESC;`	This query uses `force_matching_signature` to group structurally identical queries and counts how many different versions are in each group. A high number (e.g., `> 20`) indicates that the SQL structure is being run constantly with different literals, polluting the shared pool. This directly points to one of the most common causes of `ORA-04031`.

The results of these queries show the DBA which application or schema is polluting the shared pool the most and provide a clear roadmap on where to focus correction efforts.

Chapter 5: Solution Strategies: Quick Fixes and Permanent Solutions

When dealing with an ORA-04031 error, solutions should be considered in a hierarchy. Some solutions are temporary “band-aids” that quickly get the system running again, others treat the symptoms of the problem, and the most effective ones are permanent “cures” that eliminate the root cause of the disease. A DBA’s job is to choose the right solution for the situation and aim for permanent solutions in the long run.

5.1. Temporary Solutions: Getting the System Up and Running Quickly (Band-Aids)

These solutions are used to immediately stop the ORA-04031 errors that are interrupting the system’s operation. However, since they do not solve the underlying problem, the error is likely to reappear shortly after.

ALTER SYSTEM FLUSH SHARED_POOL;: This command instantly clears all reloadable objects (SQL plans, PL/SQL packages, etc.) from the Shared Pool. This temporarily frees up space by merging fragmented memory, allowing new memory requests to be met.
- Warnings: This command should be used with care. Flushing the shared pool causes all SQL queries run after that moment to be “hard parsed” again. This can lead to a sudden spike in CPU usage and a drop in performance right after the command is run. Also, if the system is already suffering from a severe memory shortage, this command itself might not even work and could cause errors like ORA-01012: not logged on. This is a temporary fix that only buys time until the next database restart, rather than solving the problem.
Restarting the DB: This is the most definitive but most disruptive temporary solution. Restarting the database completely clears and rebuilds the entire SGA (and therefore the Shared Pool). This definitely gets rid of the ORA-04031 problem, but it causes the database to be out of service for a while. Needing to restart frequently is a clear sign that there is a serious underlying problem and that a reactive management approach is being used.

5.2. Permanent Configuration Solutions: Optimizing Memory Management (Symptom Treatment)

These solutions are aimed at treating the symptoms of the error. Even if they don’t solve the underlying application problem, they can help the database cope better with the issue, delaying or preventing the error from occurring.

Increasing Pool Sizes: The most common first reaction to an ORA-04031 error is to increase the size of the relevant memory pool. Depending on the pool mentioned in the error message, the following commands can be used :
- ALTER SYSTEM SET SHARED_POOL_SIZE = '500M' SCOPE=BOTH;
- ALTER SYSTEM SET LARGE_POOL_SIZE = '100M' SCOPE=BOTH;
- ALTER SYSTEM SET SHARED_POOL_RESERVED_SIZE = '50M' SCOPE=BOTH; This approach can be effective, especially in systems with legitimately high memory needs or mild fragmentation. However, if the main problem is severe fragmentation, increasing the pool size is just like “getting a bigger trash can”; the problem isn’t solved, its appearance is just delayed.
Automatic Memory Management (AMM/ASMM): This is one of the most recommended approaches for modern Oracle databases. Instead of manually setting parameters like SHARED_POOL_SIZE and DB_CACHE_SIZE, Oracle is allowed to manage these pools dynamically.
- ASMM (Automatic Shared Memory Management): Enabled by setting the SGA_TARGET parameter. Oracle dynamically shifts memory between SGA components (Shared Pool, Buffer Cache, etc.) based on the workload. For example, when pressure on the Shared Pool increases, it can take memory from the Buffer Cache to grow the Shared Pool.
- AMM (Automatic Memory Management): Enabled by setting the MEMORY_TARGET parameter. This goes a step beyond ASMM by managing both SGA and PGA (Program Global Area) memory from a single pool. Automatic management significantly reduces the risk of ORA-04031 because Oracle can proactively expand the relevant pool as soon as it feels memory pressure.

5.3. Root Cause Solutions: Improvements at the Application and Code Level (The Cure)

These strategies solve the problem at its core and are the most important steps for long-term database health.

The Power of Bind Variables: The ultimate and most correct solution for ORA-04031 errors caused by fragmentation is to ensure that applications use bind variables.
- Bad Practice (Literal SQL):SQL// This code causes a new hard parse in every loop. for (int i=0; i<1000; i++) { statement.executeQuery("SELECT * FROM products WHERE product_id = " + i); }
- Good Practice (Bind Variable):Java// This code is parsed once, executed 1000 times. PreparedStatement pstmt = connection.prepareStatement("SELECT * FROM products WHERE product_id =?"); for (int i=0; i<1000; i++) { pstmt.setInt(1, i); ResultSet rs = pstmt.executeQuery(); }
This change dramatically reduces the load on the shared pool, prevents fragmentation, and improves overall database performance.
DBMS_SHARED_POOL.KEEP: This Oracle package is used to “pin” frequently used and large PL/SQL packages, procedures, or SQL queries in the shared pool. A pinned object cannot be kicked out of the pool by the LRU algorithm and always stays in memory. This ensures that critical application code is always quickly accessible and prevents the fragmentation that could be caused by constantly reloading these objects. This is a targeted optimization technique, especially for large and complex PL/SQL-based applications. However, it should be used carefully, as pinning too much can leave insufficient space for new SQLs, leading to an ORA-04031 error.

Chapter 6: Preventive Measures and Best Practices

While solving the ORA-04031 error is important, the ideal situation is to prevent this error from ever happening. This requires a shift from reactive responses to a proactive management philosophy. The following best practices provide a framework for maintaining the health of the database memory and minimizing the risk of ORA-04031.

6.1. Proactive Memory Management and Regular Monitoring

Instead of waiting for a problem to occur, it is crucial to regularly monitor the state of the database memory. This allows for the early detection of potential issues and intervention before they grow.

Regular Reporting: Use automated scripts to regularly collect data (e.g., daily or weekly) from views mentioned in Chapter 3, such as V$SGASTAT, V$SHARED_POOL_RESERVED, and V$LIBRARYCACHE. Tracking trends over time can show that memory usage is increasing or that fragmentation is slowly building up.
Using Advisor Views: Periodically check advisor views like V$SHARED_POOL_ADVICE to assess whether the current shared pool size is still optimal for the workload.
AWR/Statspack Reports: Regularly analyze Automatic Workload Repository (AWR) or Statspack reports. The “Library Cache Activity” and “Memory Allocation” sections in these reports contain valuable information about high hard parse rates or memory issues.

6.2. Coding Standards: Creating a Culture of Avoiding Literal SQL

Organizational processes are as important as technical solutions. Preventing the use of literal SQL, the most common cause of ORA-04031 errors, requires collaboration between DBAs and development teams.

Developer Training: Provide regular training to development teams on what bind variables are, why they are important, and how to use them. Explain with concrete examples how the shared pool works and the negative impact of literal SQLs on the database.
Code Review Processes: Include checks for database interactions in code review processes. Ensure that new code accessing the database uses bind variables.
Static Code Analysis Tools: Use tools that automatically scan code that generates database queries to detect the use of literal SQL.

This is a cultural change that prevents the problem at its source—in the application code—rather than trying to solve it at the database level.

6.3. The Importance of Keeping the Database Version and Patches Up to Date

In new database versions and regularly released patches (Release Updates – RU), Oracle continuously improves memory management algorithms and fixes known memory leaks or bugs.

Checking for Known Bugs: When an ORA-04031 error is encountered, a standard step is to research known bugs for the used database version on My Oracle Support (MOS). The problem might be a known bug, and the solution could be as simple as applying a patch.
Regular Patching Strategy: Keeping databases up to date not only closes security vulnerabilities but also proactively protects against many known bugs that can cause such performance and stability issues. For example, it is known that Oracle 11g is more resilient to shared pool fragmentation than 10g.

These preventive measures make ORA-04031 a rare event, allowing database administrators to spend their time on strategic improvements rather than firefighting.

Chapter 7: Advanced Analysis: Diving Deep into Fragmentation with X$KSMSP

While standard V$ views are sufficient to diagnose most ORA-04031 scenarios, sometimes it is necessary to go deeper into the problem and see the memory map of the shared pool in the finest detail. At this point, Oracle’s internal and undocumented X$ tables come into play. X$KSMSP is one of the most powerful and dangerous of these tables.

7.1. X$KSMSP: What It Is and Why It Should Be Used with Caution

X$ tables are the raw data structures on which V$ views are based. X$KSMSP (Kernel Services Memory Sga Pool) is a table that lists every single memory chunk in the shared pool heap. Each row contains the address, size, and status (free, recreatable, freeable, etc.) of a memory chunk. This allows the shared pool to be examined as if under a microscope.

However, this power comes at a price. Querying the X$KSMSP table requires intensive access to the latches that protect the shared pool itself. A query run against this table on a busy production system can cause serious contention on the shared pool latches. This can worsen the existing memory problem and negatively affect the overall performance of the system. Therefore, many experts warn that querying X$KSMSP on a live production system is a “very bad idea.”

This is a high-risk, high-reward diagnostic tool. Its use should be limited to the following situations:

When all other diagnostic methods have been exhausted and the nature of the problem cannot be understood.
Ideally, in a test or development environment where the problem can be reproduced.
If it must be used in production, at a time when the system load is at its lowest or during a controlled maintenance window. Heeding these warnings is a sign of a mature and risk-aware database management approach.

7.2. Observing and Interpreting Memory Fragmentation Using X$KSMSP

The power of X$KSMSP lies in visualizing the distribution of free memory in the shared pool. A query like the one below provides a clear picture of the fragmentation level by grouping free memory chunks by their size :

SQL

SELECT
  ksmchcom AS chunk_comment,
  CASE
    WHEN ksmchsiz < 1024 THEN '00: < 1KB'
    WHEN ksmchsiz BETWEEN 1024 AND 4095 THEN '01: 1KB-4KB'
    WHEN ksmchsiz BETWEEN 4096 AND 8191 THEN '02: 4KB-8KB'
    ELSE '03: > 8KB'
  END AS chunk_size_bucket,
  COUNT(*) AS chunk_count,
  SUM(ksmchsiz) AS total_bytes,
  TRUNC(AVG(ksmchsiz)) AS avg_bytes
FROM
  x$ksmsp
WHERE
  ksmchcls = 'free'
GROUP BY
  ksmchcom,
  CASE
    WHEN ksmchsiz < 1024 THEN '00: < 1KB'
    WHEN ksmchsiz BETWEEN 1024 AND 4095 THEN '01: 1KB-4KB'
    WHEN ksmchsiz BETWEEN 4096 AND 8191 THEN '02: 4KB-8KB'
    ELSE '03: > 8KB'
  END
ORDER BY
  1, 2;

Interpreting the output of this query is key to diagnosing fragmentation:

A Healthy Pool: The output shows a significant number (chunk_count) and total size (total_bytes) of free memory chunks, especially in the large-sized groups (> 8KB). The number of chunks in the small-sized groups is lower.
A Fragmented Pool: In the output, the total amount of free memory (total_bytes) may be high, but a large portion of this memory is concentrated in very small-sized groups (< 1KB, 1KB-4KB). The chunk_count is very high in small groups, while there are either none or very few in the large-sized groups (> 8KB).

This output definitively proves that the cause of the ORA-04031 error is not a lack of total free memory, but a lack of contiguous free memory. This confirms the findings from the V$SHARED_POOL_RESERVED view and reveals, without a doubt, that the root of the problem is memory fragmentation.

Conclusion: Moving from Reactive Solutions to Proactive Management

The ORA-04031: unable to allocate bytes of shared memory error is one of the most instructive problems in the Oracle database ecosystem. Behind a simple error message lie deep lessons about memory management, application design, and system monitoring philosophies. At the end of this comprehensive analysis, a few key conclusions emerge for dealing with and preventing this error.

The first and most important lesson is to internalize the “Victim and Culprit” paradigm. The fact that the process receiving the error is usually innocent, and the real culprit is a systemic behavior that consumes or fragments memory over time, channels diagnostic efforts in the right direction. This perspective takes DBAs out of the narrow scope of analyzing a single query and directs them to question the memory usage habits and application architecture of the entire system.

Secondly, it should be understood that solutions must be evaluated in a hierarchy. Emergency interventions like the ALTER SYSTEM FLUSH SHARED_POOL command or restarting the database are temporary band-aids that stop the bleeding but do not heal the wound. Configuration changes, such as increasing pool sizes or enabling Automatic Memory Management (AMM/ASMM), treat the symptoms and give the system more resilience. But the real and permanent “cure” is possible by getting to the root of the problem. For the vast majority of ORA-04031 errors caused by fragmentation, this cure is to ensure the use of bind variables in the application code. This not only prevents ORA-04031 but also reduces CPU usage, lowers “hard parse” rates, and increases overall database performance and scalability.

In conclusion, the ORA-04031 error is rarely just a database configuration problem. It is often a mirror reflecting the quality and design of the applications running on it. To effectively combat this error, database administrators must move from a reactive “firefighting” mode to a proactive “system health management” mode. This transition includes practices such as regular monitoring, establishing coding standards in close collaboration with development teams, keeping the database up to date, and effectively using the modern diagnostic and management tools offered by Oracle. The ultimate goal is to turn ORA-04031 from a moment of crisis into an opportunity to continuously improve the overall health and stability of the database ecosystem.

The post Oracle ORA-4031 Error: Analysis and Solutions appeared first on Bugra Parlayan | Oracle Database & Exadata Blog.

Best Practices for Very Large Database (VLDB) Backup and Recovery:

Bugra Parlayan — Sat, 14 Jun 2025 17:44:00 +0000

1. Executive Summary

Backing up and recovering Very Large Databases (VLDBs) presents a critical yet increasingly complex challenge for organizations in today’s data-driven world. With data volumes growing exponentially, traditional backup methods often fall short in meeting performance targets, Recovery Time Objectives (RTOs), and Recovery Point Objectives (RPOs). This report examines best practices in VLDB backup and recovery, drawing on insights presented in Oracle MAA (Maximum Availability Architecture) blog posts, with a specific focus on Oracle’s Zero Data Loss Recovery Appliance (ZDLRA) solution.

The ZDLRA is a purpose-built engineered system designed to address these challenges. Its core strategies include “Incremental Forever” backups, which significantly reduce the load on production systems; real-time redo protection for near-zero data loss; and continuous recovery validation, enhancing the reliability of backups. These features are tailored to meet the unique demands of VLDBs, offering substantial improvements in achieving RTO and RPO targets. Oracle’s development and promotion of a specialized hardware/software appliance like ZDLRA suggest that traditional, software-only backup methods are increasingly inadequate for the scale and criticality of modern VLDBs. This implies the problem’s complexity has reached a point where integrated hardware and software solutions offer a more effective approach than generic software tools running on general-purpose hardware. This is a significant paradigm shift in high-end backup and recovery strategies. Consequently, organizations managing VLDBs must assess whether their current backup infrastructure can realistically meet future demands or if a specialized appliance approach is becoming a necessity.

2. Introduction to VLDB Backup and Recovery Challenges

Very Large Databases (VLDBs) typically contain terabytes to petabytes of data and are continuously growing. The sheer size and complexity of these databases introduce unique challenges in backup and recovery processes.

Defining VLDBs and Their Criticality: VLDBs are central repositories for businesses’ core operations, customer data, financial records, and other vital information. Therefore, any data loss or prolonged downtime in these systems can lead to severe financial losses, reputational damage, and legal repercussions. Business continuity and regulatory compliance are primary drivers for robust backup and recovery strategies for VLDBs.
Common Pain Points:
- Backup Windows: Completing full backups of VLDBs within limited timeframes without an acceptable impact on production performance is extremely difficult. As database size increases, full backup times lengthen, often encroaching on business hours and negatively affecting system performance.
- Recovery Time Objectives (RTO): Restoring and recovering massive databases quickly enough to meet business needs in the event of a disaster or failure is a major hurdle. Long recovery times lead to extended business disruptions and, consequently, increased costs.
- Recovery Point Objectives (RPO): There is always a risk of significant data loss due to the time gap between backups. Even hourly or more frequent backups can lead to unacceptable data loss in high-transaction-volume systems.
- Performance Impact: Backup operations generate significant I/O (Input/Output) and CPU (Central Processing Unit) load on production database servers. This load can degrade application performance, especially during peak hours.
- Storage Costs: Managing and storing large volumes of backup data incurs substantial storage costs. Long-term retention policies and multiple backup copies further escalate these costs.
- Complexity: Managing complex backup scripts, schedules, and recovery procedures creates a significant operational burden and increases the risk of human error.

These challenges are not just technical but also economic and operational. The “pain points” are interconnected; for example, trying to shrink backup windows with traditional methods can increase the performance impact, and aggressive RPO targets can lead to higher storage costs. Because VLDBs are large, backups are inherently time-consuming. Businesses demand short RTOs and near-zero RPOs. Attempting frequent full backups on VLDBs (for RPO) exacerbates backup window and performance impact issues. Using traditional incremental backups can lead to complex and lengthy recovery processes, jeopardizing RTO. This creates a cycle of trade-offs where optimizing one aspect negatively affects another. This highlights the need for a holistic solution that addresses these interconnected challenges simultaneously, rather than in isolation, which is the rationale for an integrated system like ZDLRA.

3. Oracle’s Zero Data Loss Recovery Appliance (ZDLRA): A Purpose-Built Solution for VLDBs (Based on Oracle MAA Blog Insights)

Oracle’s Zero Data Loss Recovery Appliance (ZDLRA) is a purpose-built engineered system developed to address the challenges encountered in backing up and recovering Very Large Databases (VLDBs). This section will examine the core features of ZDLRA and its significance in VLDB protection, based on points highlighted in Oracle MAA blogs.

3.1. Overview of the ZDLRA Approach

The ZDLRA is a purpose-built engineered system developed by Oracle to centralize and optimize database backup and recovery operations, focusing on protection, efficiency, and scalability for Oracle databases. It’s crucial to emphasize that ZDLRA is not merely a software solution but a comprehensive one where hardware and software are co-engineered for optimal performance and reliability in the demanding context of VLDB protection. As stated in the Oracle MAA blog post, “ZDLRA is a purpose-built engineered system designed to maximize hardware and software to provide a highly available, zero data loss environment. It notes that software alone cannot achieve this, implying that ZDLRA’s integrated hardware and software approach is critical for meeting stringent RTO and RPO requirements.” This positions ZDLRA not just as software, but as a comprehensive solution where hardware and software are co-designed for optimal performance and reliability under the demanding conditions of VLDB protection.

3.2. The “Incremental Forever” Strategy for VLDBs

One of ZDLRA’s most notable features is its “Incremental Forever” or “virtual full” backup strategy. This strategy fundamentally changes the backup process for VLDBs.

Mechanism: After an initial full (Level 0) backup, only changed data blocks are sent from the production database to the ZDLRA. The ZDLRA then synthesizes full backups (“virtual fulls”) from these incremental backups. This eliminates the overhead of taking full backups every day.
Benefits for VLDBs:
- Reduced Production Impact: “This strategy reduces the processing load on production systems by only transmitting changed data during daily incremental backups.” This minimizes I/O and CPU load on the source database, which is critical for performance-sensitive VLDBs. Traditional Level 0 + Level 1 backups are problematic for VLDBs: Level 0s are too large and impact performance; recovery from many Level 1s is slow. ZDLRA’s “Incremental Forever” sends only changed blocks after the initial full backup. This dramatically reduces the daily backup workload on the production database.
- Storage Efficiency: Efficient storage of incremental data on the ZDLRA and pointer-based virtual fulls “can lead to a 10X decrease in space consumption.” This offers a significant cost advantage, especially when dealing with large data volumes.
- Faster Backup Completion: Daily “backups” are essentially small incremental backups, significantly shortening the backup window.
- Efficient Recovery: It “allows for more efficient recovery compared to traditional RMAN incremental-based recovery.” Restoring a virtual full backup is similar to restoring a true full backup, without the need to sequentially apply numerous incrementals. ZDLRA takes on the task of creating “virtual full” backups from these incrementals. This means the appliance, not the production server, does the heavy lifting. For recovery, the database can be restored from a virtual full, which is much faster than applying a long chain of traditional incrementals. This directly improves RTO.
- Offloading of Backup Operations: “By offloading backup compression, deletion, validation, and maintenance operations to the appliance, production systems can focus on workloads.” This further frees up production server resources.

This approach fundamentally changes the backup paradigm for Oracle databases. By shifting intelligence and workload from the source database to a specialized appliance, it allows production systems to dedicate their resources to business operations. It also simplifies recovery processes, reducing the potential for human error.

3.3. Achieving Real-Time, Near-Zero Data Loss Protection (Near-Zero RPO)

ZDLRA offers an innovative approach to minimizing the Recovery Point Objective (RPO).

Mechanism: “The Recovery Appliance uses Oracle’s real-time redo transport to deliver continuous, real-time data protection. Transactional changes (redo) are transmitted directly to the appliance, where archived redo log backups are created and stored.” This mechanism is similar to Data Guard redo transport but specifically designed for backup and recovery assurance.
Benefits: “This provides immediate, zero data loss protection of all changes, and directly addresses the RPO objective of minimizing data loss.” This means recovery can be performed up to the last committed transaction received by the ZDLRA, achieving an RPO of seconds or sub-seconds rather than hours. Traditional RPO is often tied to the frequency of archived log backups or discrete incremental data backups. For VLDBs, there can still be gaps between these discrete operations (e.g., every 15 mins, 1 hour). Real-time redo transport sends redo data as it’s generated (or very close to it) to the ZDLRA. The ZDLRA then archives this redo. This means that even if a failure occurs between discrete incremental data backups, the redo logs up to (or very near) the point of failure are already secured on the ZDLRA. This dramatically improves RPO beyond what traditional scheduled backups can offer, allowing for recovery with minimal or no data loss.

This feature is a game-changer for businesses with extremely low tolerance for data loss. It elevates ZDLRA from merely a backup device to a key component of a high-availability and data protection strategy, approaching disaster recovery capabilities for recent transactions. It also implies a tighter integration with the database’s transaction processing cycle.

3.4. Ensuring Data Integrity with Continuous Recovery Validation

The reliability of backups is paramount for successful recovery. ZDLRA takes a proactive approach to this.

Process: “The appliance performs corruption detection throughout the backup cycle to validate data consistency and immediately alerts administrators if corruption is detected. It checks all incoming and replicated backups for block-level validity. Any corrupted data is detected, recorded, and alerted, allowing administrators to take action.”
Benefits: “This assurance of valid data is a key component for a successful recovery, directly impacting RTO by ensuring that restored data is usable and not corrupt.” This proactive validation prevents the discovery of corruption only at the critical moment of recovery, which could severely impact RTO and business operations. Data block corruption can occur on the primary database and, if undetected, propagate to backups. Traditional validation might happen during the backup process (e.g., RMAN VALIDATE DATABASE) or as a separate scheduled task, but ZDLRA makes it an intrinsic part of data ingestion. By checking blocks as they arrive and as virtual fulls are created/maintained, ZDLRA provides an early warning system. If corruption is detected in an incoming backup, administrators are alerted immediately. This allows them to address the issue on the primary database or ensure subsequent backups are clean, rather than discovering the problem months later during a critical restore. This ensures that backups stored on ZDLRA are known to be good, which is fundamental for a predictable and successful RTO.

This feature increases confidence in the backup repository. It means that when a recovery is initiated, there is a much higher certainty that the restored data will be valid and uncorrupted, reducing the risk of failed recoveries or recoveries that bring back corrupt data, which can be worse than no recovery at all. This also reduces the need for extensive manual validation efforts.

3.5. The Significance of an Engineered System Approach for VLDBs

The idea that ZDLRA is not just software but an integrated hardware and software solution is fundamental to its effectiveness in VLDB protection. “The article emphasizes that the ZDLRA is a purpose-built engineered system designed to maximize hardware and software… It states that software alone cannot achieve this.” This co-engineering allows for optimizations in I/O, network traffic, storage management, and processing that would be difficult to achieve with general-purpose components. Protecting VLDBs efficiently requires high throughput for backups, fast access for restores, and robust processing for tasks like validation and virtual full creation. General-purpose hardware and backup software might not be optimally configured to work together for these specific, demanding Oracle database workloads. An engineered system allows the vendor (Oracle) to control and optimize all layers: the database-side agents, the network protocols used, the internal processing within the appliance, and the storage layout. This tight integration can lead to performance, reliability, and manageability benefits that are hard to replicate with a piecemeal approach.

The “engineered system” argument positions ZDLRA as a premium, high-performance solution where the whole is greater than the sum of its parts. It implies that Oracle has fine-tuned every component of the stack, from database interaction to storage within the appliance, for the specific task of Oracle database protection. While potentially carrying a higher upfront cost, the engineered system approach aims to deliver a lower TCO (Total Cost of Ownership) through operational efficiencies, reduced risk, and superior performance. It also signifies a single-vendor commitment to supporting the entire solution stack, potentially simplifying troubleshooting and support. This is a strategic choice for organizations where VLDB protection is a top-tier priority.

Challenge Area	Traditional Approach Pain Points	ZDLRA Solution & Key Features Leveraged
Backup Window	Long full backups, performance impact	Incremental Forever, offloaded processing
RTO	Slow recovery from many incrementals, risk of corrupt backup	Virtual Full Backups, Continuous Recovery Validation
RPO	Data loss since last backup (hours)	Real-Time Redo Transport
Production Impact	High CPU/IO during backups	Incremental Forever (sends only changes), offloaded processing (compression, validation)
Storage Consumption	Multiple full backups, large incrementals	Incremental Forever (stores deltas efficiently), space-efficient virtual fulls
Backup Integrity	Corruption detected late (at restore or via periodic checks)	Continuous Recovery Validation (proactive, during backup cycle )
Management Complexity	Complex scripting, scheduling, manual validation	Centralized appliance management, automated validation and virtual full creation

This table visually reinforces how ZDLRA directly addresses specific, long-standing pain points in VLDB management, making it easier to quickly grasp the benefits that would justify evaluating such a system.

4. Key Considerations and Best Practices in ZDLRA-Centric VLDB Backup and Recovery Implementations

While ZDLRA offers powerful capabilities that significantly improve VLDB backup and recovery processes, fully leveraging these capabilities requires careful planning, configuration, and adherence to operational best practices. This section will translate ZDLRA’s features into actionable considerations and best practices. Although the provided Oracle blog post summaries indicate they do not offer additional general best practices beyond ZDLRA itself , this section will focus on how best to leverage ZDLRA’s capabilities and what to pay attention to within the ZDLRA context.

4.1. Optimizing Recovery Time Objective (RTO) with ZDLRA

Leverage ZDLRA’s virtual full backups for rapid restores. This significantly shortens recovery times.
Ensure ZDLRA sizing is adequate to meet restore performance demands. Insufficient resources can lead to missed RTO targets.
Regularly test recovery scenarios to validate RTOs. ZDLRA’s “Continuous Recovery Validation” ensures backups are valid, a prerequisite for meeting RTOs, but real-world tests confirm the entire process works as expected.

4.2. Minimizing Recovery Point Objective (RPO) with ZDLRA

Implement and monitor real-time redo transport diligently. This is the primary mechanism for achieving near-zero RPO according to.
Understand and meet network requirements to ensure real-time redo transport does not lag. Insufficient network bandwidth or high latency can compromise RPO targets.

4.3. Managing Production System Performance

While ZDLRA’s “Incremental Forever” strategy significantly reduces production impact, confirm this by monitoring baseline database performance metrics post-implementation.
Optimize network bandwidth between production databases and the ZDLRA. This is critical for the efficiency of both incremental backups and real-time redo transport.

4.4. Ensuring Backup Data Integrity and Reliability

Rely on ZDLRA’s “Continuous Recovery Validation” , but also understand its alerting mechanisms and integrate them into operational monitoring systems. Early warnings allow for proactive resolution of potential issues.
Consider ZDLRA replication to a secondary ZDLRA for disaster recovery of the backup data itself. This ensures backups are protected even if the primary ZDLRA fails.

4.5. Storage Management and Efficiency within ZDLRA

Understand ZDLRA’s internal storage management, space reclamation, and how the “up to 10X decrease in space consumption” is achieved and monitored.
Plan retention policies carefully to balance recovery needs with storage capacity. Overly long retention periods can lead to unnecessary costs, while too short retention can limit recovery capabilities.

4.6. Network Configuration and Sizing

Emphasize the importance of dedicated, high-bandwidth, low-latency network connectivity between production databases and the ZDLRA, especially for real-time redo transport and large data transfers. The network should not be a bottleneck for backup and recovery performance.

4.7. Regular Testing and Validation of Recovery Procedures

Even with ZDLRA’s automation and validation, conduct periodic, full recovery drills to test the end-to-end process, human procedures, and infrastructure. This validates the entire recovery plan, not just the technology.

Implementing ZDLRA is not a “set it and forget it” solution. While it automates and optimizes many aspects, careful planning, configuration, ongoing monitoring, and testing are still critical to realizing its full benefits. The “best practices” shift from managing the intricacies of RMAN scripts to managing the ZDLRA ecosystem. ZDLRA offers advanced features like “Incremental Forever,” “Real-Time Redo,” and “Continuous Validation.” These features have prerequisites and operational aspects (e.g., network for redo, monitoring alerts for validation, capacity planning for storage). Simply deploying the appliance does not guarantee optimal RTO/RPO or reliability. Administrators must understand how these features work, configure them correctly, monitor their performance, and integrate ZDLRA into broader DR and operational procedures. Regular testing is essential to confirm the entire system (database, network, ZDLRA, recovery procedures) performs as expected under pressure. The role of the Database Administrator (DBA) also evolves in this context. They may spend less time on low-level backup scripting and more on strategic data protection management for ZDLRA, capacity planning, and ensuring end-to-end recoverability of business services. Expertise specific to ZDLRA itself becomes important.

5. Conclusion and Recommendations

As presented in the Oracle MAA blog posts, the Zero Data Loss Recovery Appliance (ZDLRA) offers significant advantages in the realm of Very Large Database (VLDB) backup and recovery. These benefits include near-zero data loss through a vastly improved Recovery Point Objective (RPO), reliable Recovery Time Objective (RTO) via virtual full backups and continuous validation, reduced impact on production systems, and enhanced data integrity.

As an engineered system, ZDLRA represents a strategic approach to tackling the complexities of VLDB protection. The co-engineering of hardware and software allows for performance and reliability optimizations that are difficult to achieve with general-purpose solutions. This is a critical differentiator, especially in today’s environment where data volume and transaction rates challenge traditional methods.

However, it must be emphasized that while ZDLRA offers powerful capabilities, successful implementation requires careful planning, a full understanding of its features, and adherence to operational best practices, particularly concerning network configuration, monitoring, and regular recovery testing. Adopting ZDLRA is not merely a technical decision but signifies a commitment to a high level of data protection and availability, driven by the critical nature of the VLDBs it protects. This is an investment that should align with the value of the data and the cost of downtime/data loss.

It is important to note that this report focuses on ZDLRA-centric best practices highlighted in the provided Oracle blog post summaries. A comprehensive discussion of all VLDB backup and recovery techniques, including non-ZDLRA alternatives or complementary strategies like storage snapshots or Oracle Data Guard for DR beyond backup, would require additional resources beyond the scope of the provided material.

In conclusion, organizations should evaluate ZDLRA as part of their overall IT strategy, considering its integration with other systems, the skills required to manage it, and its alignment with long-term data growth and protection needs. When implemented and managed correctly, ZDLRA can provide unparalleled protection and recovery assurance for VLDBs, helping businesses secure one of their most valuable assets: their data.

Ref:

https://blogs.oracle.com/maa/post/very-large-database-backup-and-recovery-best-practices

https://blogs.oracle.com/maa/post/very-large-database-backup-and-recovery-best-practices-part-2

The post Best Practices for Very Large Database (VLDB) Backup and Recovery: appeared first on Bugra Parlayan | Oracle Database & Exadata Blog.

Analysis of Delta Push and Delta Store Mechanisms within ZDLRA

Bugra Parlayan — Wed, 04 Jun 2025 19:35:29 +0000

I. Introduction to Oracle Zero Data Loss Recovery Appliance (ZDLRA) Technology

The Oracle Zero Data Loss Recovery Appliance (ZDLRA or Recovery Appliance), an engineered system specifically designed for Oracle Databases, was developed to eliminate data loss and significantly reduce the data protection workload on production database servers. Its primary goal is to protect transactions in real-time, enabling databases to be recovered to within less than a second in the event of an outage or ransomware attack. This approach fundamentally differs from traditional backup solutions, which often lead to data loss measured in hours or even a day. ZDLRA works in tight integration with Oracle Database and Recovery Manager (RMAN) and offers capabilities not possible with general-purpose backup solutions. It is built upon Exadata hardware, from which it inherits performance and scalability features. Positioned for modern cybersecurity protection, ZDLRA offers features like backup immutability and continuous validation.

The data protection philosophy underlying this system represents a paradigm shift from reactive backup to proactive, continuous data protection. Traditional backup operations are performed periodically (e.g., nightly), which inherently carries the potential for data loss since the last backup. ZDLRA, on the other hand, captures changes as they occur through mechanisms like “real-time protection” and “real-time redo transport”. This continuous capture reduces the Recovery Point Objective (RPO) to sub-second levels , significantly mitigating the business risk associated with data loss and moving beyond mere backup to provide near-continuous data assurance.

The fact that ZDLRA is an “engineered system” (built on Exadata ) is critical to its performance and reliability. General-purpose backup solutions run on general-purpose hardware and software, which may not be optimized for the unique I/O patterns and metadata intensity of Oracle Database backups. As an engineered system like Exadata, ZDLRA’s hardware (storage, nodes, InfiniBand ) and software are co-engineered and pre-tuned for Oracle Database workloads. This co-engineering enables high throughput (up to 60 TB per hour per rack ), efficient handling of Oracle-specific block formats, and the scalability required for enterprise-wide database protection. Thus, ZDLRA’s effectiveness stems not just from its software features but from its holistic system design, purpose-built for Oracle databases.

A. “Incremental Forever” Backup Strategy: A Paradigm Shift

ZDLRA implements an “incremental-forever” backup architecture. After an initial one-time full (level 0) backup, only incremental (level 1) backups are sent from the protected databases to the Recovery Appliance. This strategy eliminates the need for recurring full backups, which are resource-intensive on production systems and can impact application performance. The “incremental forever” approach significantly reduces backup windows, database server load (CPU, memory, I/O), and network traffic.

The “incremental forever” strategy is more than just sending fewer full backups; it is enabled by ZDLRA’s sophisticated backend processing (Delta Store) that synthesizes these incremental backups into readily usable “virtual full backups.” Simply sending only incremental backups without a mechanism to consolidate them would make restores complex and slow, requiring the sequential application of many incrementals. ZDLRA overcomes this by using the Delta Store to process incoming incremental backups and create “virtual full backups”. These virtual full backups are representations of a full backup at a specific point in time, constructed from the initial level 0 and subsequent level 1 block changes. This means that for recovery, RMAN can restore a single virtual full backup without the burden of applying numerous incremental backups on the client side , making the “incremental forever” strategy practical and highly efficient for recovery.

The following table summarizes the key concepts of ZDLRA and the benefits it offers:

Table 1: Key ZDLRA Concepts and Benefits

Concept	Brief Definition in ZDLRA Context	Key Benefit to the Organization
Zero Data Loss	Goal of reducing data loss to virtually zero by protecting database transactions in real-time.	Minimizes critical data loss risk, enhances business continuity.
Incremental Forever	Only incremental backups are taken after the initial full backup, eliminating the need for periodic full backups.	Shortens backup windows, reduces load on production systems, saves storage.
Real-Time Redo Transport	Instantaneous transfer of database redo log changes to ZDLRA.	Provides sub-second RPO (Recovery Point Objective), minimizing data loss.
Virtual Full Backup	A logical backup synthesized on ZDLRA from incremental backups, behaving like a full backup.	Enables fast restores, uses storage space efficiently.
Sub-Second RPO	Reduction of data loss tolerance to below one second.	Minimizes data loss impact for business-critical applications.
Continuous Validation	Backup integrity and recoverability are continuously checked by ZDLRA.	Ensures reliable restores, reduces risk of corrupt backups.

II. Delta Push: The Continuous Data Ingestion Engine

A. Delta Push Concept and Objectives

“Delta Push” is a term Oracle uses to describe the process by which protected databases send only the minimum necessary data (i.e., the “delta difference” or changes) to ZDLRA for protection. This process encompasses both incremental backups of data blocks and real-time transport of redo log changes. The primary objective is to minimize the impact on production systems by transmitting only unique changed data, thereby reducing CPU, I/O, and network load. It is a source-side optimization enabled by RMAN block change tracking and tight integration with the Oracle database.

“Delta Push” is more than just an incremental backup; it is a holistic strategy to capture all relevant database changes (data blocks and redo) with minimal production impact, forming the ingestion mechanism for ZDLRA’s continuous data protection. Traditional incremental backups capture changed data blocks at set intervals. Real-time redo transport continuously captures transaction log changes, even between incremental backups. While explicitly states, “Changes in the database are sent to ZDLRA using the Delta Push process,” clarifies, “Oracle calls Virtual Backups + Real-Time Redo as Delta Push.” Therefore, Delta Push is not a single technology but a combination of RMAN incremental backups and real-time redo transport working in concert to ensure comprehensive and timely capture of all database modifications. This dual approach is key to achieving both efficient backups and near-zero RPO.

B. Operational Mechanisms

1. Leveraging RMAN Incremental Backups

ZDLRA utilizes the RMAN “incremental backup” API to capture changes from the source database. After the initial level 0 backup, all subsequent backups are level 1 incremental backups. These can be cumulative backups that use the latest virtual level 0 as their baseline. The ZDLRA Backup Module (an SBT library) facilitates the transfer of these incremental backups from the protected database to the Recovery Appliance. RMAN block change tracking on the source database efficiently identifies changed blocks, so only those blocks are read and sent.

ZDLRA fundamentally transforms the purpose and outcome of an RMAN incremental backup. While the RMAN command on the client-side might look similar, ZDLRA processes it not just as a standalone incremental backup, but as a component of a virtual full backup. Traditionally, an RMAN incremental backup is a set of blocks changed since the last backup of a certain level. However, states, “When a DELTA PUSH is executed, the results are automatically transformed into a VIRTUAL FULL backup in what is known as the Delta Store inside of ZDLRA.” explains that ZDLRA opens the RMAN block, reads the data file blocks within it, and creates a new virtual level 0 using the backup already existing on ZDLRA. This indicates that ZDLRA doesn’t just store the incremental backup set; it actively parses it and integrates its constituent changed blocks into its versioned block store (Delta Store) to synthesize a new point-in-time full representation. This is a critical difference from how incremental backups are handled by traditional backup software.

2. Real-Time Redo Transport: Achieving Sub-Second RPO

This feature is key to ZDLRA’s “zero data loss” claim, providing a zero to sub-second RPO. Redo data (records of all database changes) is streamed from the protected database’s memory buffers (LGWR) directly to the Recovery Appliance, typically asynchronously to minimize performance impact. This is similar to Data Guard redo transport. ZDLRA validates the redo and writes it to a staging area. Upon a log switch on the protected database, ZDLRA converts these redo changes into compressed archived redo log file backups and tracks them in its catalog. If the redo stream terminates unexpectedly (e.g., database crash), ZDLRA can create a partial archived redo log, protecting transactions up to the last change received. With real-time redo transport enabled, this obviates the need for separate archived log backups from the database host to ZDLRA.

Real-time redo transport effectively decouples redo protection from the source database’s archiving process, enabling more resilient and immediate capture of transactions. Traditional redo protection often relies on the database successfully writing to its online redo logs and then archiving them. ZDLRA’s real-time redo transport taps into the redo stream before or concurrently with local archiving, sending it directly from memory. Even if the primary database crashes before successfully archiving a log, ZDLRA can construct a partial archive log from the redo it has already received. This means ZDLRA acts as an independent, highly available redo log destination, guaranteeing transaction capture even if the source database’s own archiving mechanism is disrupted, which is critical for sub-second RPO.

C. Architectural Integration: Data Flow from Protected Database to ZDLRA

Protected databases use the RMAN client and the Recovery Appliance Backup Module (SBT library) to communicate with ZDLRA. For incremental backups, RMAN identifies changed blocks. These blocks are packaged and sent via the backup module over the network to an HTTP Server Application (Servlet) on ZDLRA. Real-time redo is transported similarly to Data Guard (typically via Oracle Net) to a Remote File Server (RFS) process on ZDLRA. ZDLRA then validates, processes (compression, indexing), and stores the incoming data/redo blocks in the Delta Store. Metadata is updated in the Recovery Appliance Catalog.

The data flow architecture for Delta Push is bifurcated (separate paths for incremental block backups and real-time redo) but converges within ZDLRA to provide a unified data protection state. Incremental data block backups are inherently batch-oriented, typically scheduled RMAN operations, even if frequent. They are processed via the SBT interface. Real-time redo transport is a continuous, stream-based process, capturing transactional changes as they occur using Data Guard-like mechanisms. Both data streams—changed blocks and redo records—arrive at ZDLRA and are processed into and cataloged by the Delta Store. This dual-path ingestion allows ZDLRA to capture both the state of data blocks at specific points in time (via incrementals) and the continuous flow of transactions (via redo), combining the strengths of snapshot-like backups and continuous data replication to enable recovery to almost any point in time.

D. Formation of Virtual Full Backups via Delta Push

While Delta Push is the mechanism for sending changes, its direct result is to enable ZDLRA’s Delta Store to create and maintain “Virtual Full Backups”. Each Delta Push (incremental backup) operation results in a new Virtual Full Backup becoming available in the ZDLRA catalog. This means changes are tracked not just from the last physical full backup, but from the previous Virtual Full backup.

Delta Push acts as the continuous feed that allows the Delta Store to maintain a constantly up-to-date, yet historically deep, set of recovery points represented as virtual full backups. Delta Push transmits the “deltas” – the changed blocks. The Delta Store receives these deltas and intelligently integrates them with previously stored block versions. This integration allows ZDLRA to construct a logically complete backup image for a specific point-in-time corresponding to an ingested incremental backup. Therefore, Delta Push is not just about efficient data transfer; it is the critical data pipeline that fuels the Delta Store’s ability to offer fast, point-in-time recovery through virtual full backups, effectively creating a “time machine” for the database.

III. Delta Store: The Intelligent Repository for Protected Data

A. Delta Store Architectural Overview

The Delta Store is “the totality of all protected database backup data in the Recovery Appliance storage location”. It resides on a dedicated ASM disk group (typically named DELTA) on ZDLRA. It is described as the “brains” of the Recovery Appliance, responsible for validating, compressing, indexing, and storing incoming backup data. It is not merely a passive storage area; it actively manages backup data to enable efficient virtual full backups and space optimization.

The Delta Store is an application-aware storage layer, deeply integrated with Oracle Database block structures and RMAN metadata, which distinguishes it from general-purpose deduplication appliances. General-purpose deduplication appliances typically operate at a generic block level without understanding the internal structure of database files. According to , ZDLRA’s Delta Store captures copies of each Oracle block and organizes them hierarchically. highlights its “Oracle context sensitivity,” where it opens RMAN blocks to inspect their contents and index the data blocks for each data file. This database awareness allows for more intelligent deduplication (block versioning rather than just hash-based deduplication of RMAN backup pieces), validation , and the creation of consistent virtual full backups. This intelligence is what enables ZDLRA to perform block-correctness and RMAN recoverability validation directly on the appliance, offloading the production server.

B. Internal Structure and Data Organization

1. Delta Pools: Granular Management of Data File Backups

The Delta Store contains “delta pools” for all data files across all protected databases. A delta pool is the set of data file blocks from which the Recovery Appliance constructs virtual full backups. Each distinct data file whose backups are sent to ZDLRA has its own dedicated delta pool. For example, datafile 10 from database prod1 has its own delta pool.

The concept of a delta pool signifies a highly granular and organized approach to managing backup data, enabling efficient block versioning and retrieval at the individual data file level. Databases consist of multiple data files, each with its own lifecycle of changes. By maintaining a separate delta pool per data file , ZDLRA can track and version blocks specifically for that file. When an incremental backup for a data file arrives, ZDLRA updates the relevant delta pool with the new block versions. This level of granularity is essential for constructing a virtual full backup, as ZDLRA can quickly locate the correct versions of all blocks for each data file belonging to a specific point-in-time backup by querying these distinct pools. It also likely aids in space management and reclamation, as old blocks can be managed within the context of their specific data file pool.

2. Block Versioning and Indexing Mechanisms

The Delta Store is effectively a database of block versions. As incremental backups (Delta Pushes) arrive, the changed blocks are indexed into the Delta Store. The Recovery Appliance receives an incremental backup, validates it, compresses it, and writes it to a delta store. It indexes the backup so that corresponding virtual full backups become available. The ZDLRA metadata database, which includes the RMAN recovery catalog, manages the metadata about these blocks and their versions.

The indexing of individual database blocks within the Delta Store, not just backup pieces, is the core enabler of “virtual full backups” and efficient space utilization. Traditional backups store entire backup pieces (full or incremental). Restoration requires locating and processing these pieces. ZDLRA, in contrast, extracts individual data blocks from incoming incremental backups and indexes these blocks. The Delta Store maintains various versions of these blocks. A “virtual full backup” is essentially a metadata construct – a list of pointers to the correct versions of all blocks (from various delta pools) that constitute the database at a specific point in time. This block-level versioning and indexing mean that unchanged blocks are stored only once, and new “full” backups are created logically by updating pointers, rather than physically re-copying all data. This is the essence of space efficiency and rapid virtual full creation.

C. Creation and Management of Virtual Full Backups within Delta Store

The Delta Store uses the ingested incremental backups (via Delta Push) to create virtual full backups. A virtual full backup is a pointer-based representation of a physical full backup at the time of the incremental backup. It appears as a standard level 0 backup in the RMAN catalog. To create a virtual full backup, ZDLRA converts an incoming incremental level 1 backup into a virtual representation of an incremental level 0 backup. It combines the new changed blocks from the incremental backup with the previous unchanged blocks already present in the Delta Store. These virtual full backups are typically 10 times more space-efficient than physical full backups.

The creation of virtual full backups is an ongoing, dynamic process within the Delta Store, triggered by each successful Delta Push, ensuring that the latest recovery points are always full representations. states, “Each Delta Push sends the latest version of each changed block. Those changed blocks are indexed into the Delta Store and combined with previous un-changed blocks to form a Virtual Full Backup.” notes, “After the process [backup], the catalog reflects all the new virtual full backups that are available.” This implies a continuous synthesis. As new incremental data arrives, the Delta Store doesn’t just store the incremental; it actively processes it to update its pointers and metadata, making a new, comprehensive virtual full backup immediately available. This proactive synthesis is why ZDLRA can offer fast restores to recent points in time without the delay of manually applying many incrementals during the restore operation itself. The “merge” or “synthesis” happens upfront on ZDLRA.

D. Storage Optimization and Efficiency

1. Advanced Compression Techniques (including `RA_FORMAT`)

ZDLRA employs specialized block-level compression algorithms. A newer client-side library feature, RA_FORMAT=TRUE (introduced around ZDLRA 23.1), allows for compression of the data within blocks before sending to ZDLRA. This is compatible with ZDLRA’s ability to create virtual full backups and validate stored backup sets. This client-side compression can compress the contents of TDE encrypted blocks as well as non-TDE blocks. If RMAN encryption is also on, non-TDE blocks are compressed then encrypted. This compression reduces network bandwidth for backups and replication, and storage space on ZDLRA. Archive log compression (BASIC, LOW, MEDIUM, HIGH) can also be configured; LOW, MEDIUM, and HIGH do not require ACO on the protected database when ZDLRA is used.

The RA_FORMAT feature represents a significant evolution in ZDLRA’s compression strategy, moving some intelligence to the client to optimize data before transmission and storage, and enabling effective compression even for TDE-encrypted data. Previously, RMAN compression would compress the entire backup set. If the data was TDE encrypted, this compressed backup set was unreadable by ZDLRA for its block-level operations. RA_FORMAT=TRUE compresses the contents of each block, leaving the block headers intact for ZDLRA to read. This allows ZDLRA to perform its virtual full backup creation and validation even on backups originating from TDE tablespaces, because the data within the blocks is compressed, but the block structure ZDLRA needs is preserved. This overcomes a major challenge in backup efficiency for encrypted databases, offering both security (TDE) and storage/network efficiency (compression), which were often mutually exclusive or suboptimal with older methods.

2. Automated Space Management and Delta Pool Optimization

The Recovery Appliance performs automated delta pool space management. This includes deleting old or expired backups (on disk and on tape/cloud) based on recovery window goals and retention policies. ZDLRA periodically reorganizes delta pools to improve restore performance by maintaining contiguity of blocks (delta pool optimization) as old blocks are deleted and new ones arrive.

Automated space management and delta pool optimization are critical for sustaining the long-term performance and efficiency of the “incremental forever” strategy. An unmanaged “incremental forever” system could lead to highly fragmented storage over time as myriad small changes accumulate and old data becomes obsolete. The deletion of old blocks reclaims space, vital for cost-effectiveness. The reorganization of delta pools addresses the potential performance degradation from fragmentation that could arise from frequent updates and deletions, ensuring restore operations remain fast by optimizing read access. These automated background tasks are therefore essential for the sustainability of the ZDLRA model, preventing it from becoming unwieldy or slow over long periods of operation.

The following table summarizes the internal components of the Delta Store and their roles:

Table 2: Delta Store Internal Components and Roles

Component	Description/Structure	Primary Function within Delta Store	Contribution to ZDLRA Efficiency/Recovery
Delta Pool	A logical unit for each data file, containing all backed-up block versions for that specific data file.	Organizing and managing blocks belonging to a specific data file.	Granular block management, efficient versioning, and rapid construction of virtual full backups.
Block Version	A copy of a data block at a specific point in time.	Tracking data changes over time.	Space efficiency (only changed blocks stored), ability to restore to any point in time.
Index	Metadata structure tracking the locations and versions of blocks within the Delta Store.	Enabling rapid location of correct block versions when constructing virtual full backups.	Fast virtual full backup creation, efficient restore operations.
Virtual Full Backup Metadata	Set of pointers to the block versions that constitute a full backup at a specific point in time.	Providing a logical representation of a physical full backup.	Storage efficiency (pointers instead of physical full backups), appears as a standard level 0 backup to RMAN, fast recovery.

IV. Synergistic Architecture: Operation of Delta Push and Delta Store within ZDLRA

A. End-to-End Data Protection Workflow: From Transaction to Recoverable Backup

The data protection process begins when a transaction occurs on the protected database. These changes are transmitted to ZDLRA almost instantaneously as part of the Delta Push mechanism.

Transaction Occurs: Changes are made in the protected database.
Real-Time Redo Push: LGWR (or an asynchronous process) sends redo data from memory buffers to ZDLRA. ZDLRA stages and validates this redo.
Incremental Backup (Delta Push): Periodically , RMAN performs an incremental level 1 backup. Changed blocks are identified via block change tracking.
Data Transfer: The Recovery Appliance Backup Module sends these changed blocks to ZDLRA.
ZDLRA Ingestion and Processing (Delta Store):
- Incoming incremental blocks are validated, compressed (if RA_FORMAT=TRUE or with ZDLRA-side compression), and indexed into their respective delta pools within the Delta Store.
- The Delta Store synthesizes a new virtual full backup using these new blocks and existing unchanged blocks.
- Redo logs are converted by ZDLRA into archived log backups upon log switch.
Catalog Update: ZDLRA’s internal RMAN catalog is updated to reflect the new virtual full backup and archived redo logs, making them available for recovery.
Continuous Validation: ZDLRA continuously validates backups for recoverability.
Lifecycle Management: Policies for retention, replication to another ZDLRA, or archival to tape/cloud are applied.

The synergy between Delta Push (ingestion) and Delta Store (processing and storage engine) creates a closed-loop system for continuous data protection and recovery readiness. Delta Push continuously feeds changed data (blocks and redo) to ZDLRA. The Delta Store immediately processes this data, integrates it into its versioned block repository, and creates virtual full backups. The updated catalog then makes these new recovery points instantly available. This tight, automated loop ensures ZDLRA is always as up-to-date as possible with the state of protected databases, minimizing data loss risk and guaranteeing that recovery assets are constantly refreshed and validated.

B. Role of Key ZDLRA Components

1. Recovery Appliance Backup Module (libra.so / SBT Library)

This Oracle-supplied SBT library is installed on protected database hosts and is used by RMAN to transfer backup data to ZDLRA. It manages communication for backup and restore operations between RMAN and ZDLRA. With newer versions (e.g., ZDLRA 23.1), this library can perform client-side compression and formatting (RA_FORMAT=TRUE).

The Recovery Appliance Backup Module is more than a simple data pipe; it’s an intelligent client-side agent that actively participates in optimizing the backup stream. Traditionally, SBT libraries are primarily interfaces for RMAN to write to third-party media managers. The ZDLRA backup module, especially with features like RA_FORMAT, performs pre-processing (compression, ZDLRA-specific formatting ) on the client-side. This client-side intelligence reduces load on ZDLRA for certain tasks, optimizes network traffic, and enables advanced features like effectively compressing TDE data before it even reaches the appliance. It acts as an essential, integrated part of the ZDLRA solution, not just a generic connector.

2. Recovery Appliance Metadata Database and Catalog

Residing on each Recovery Appliance, it manages metadata for all backups and contains the RMAN recovery catalog for all protected databases. This catalog is mandatory and is automatically updated by ZDLRA as backups are processed. It stores information about backup pieces, archived logs, virtual full backups, delta pools, and block versions, which is essential for orchestrating restores and managing space. ZDLRA uses two main disk groups: DELTA for backups and CATALOG for RMAN catalog tables.

ZDLRA’s centralized and self-managing RMAN catalog serves as the “single source of truth” for all protected databases, enabling simplified management and consistent recovery across the enterprise. In traditional environments, RMAN catalogs might be separate or controlfile-based, leading to management complexity for many databases. ZDLRA mandates and manages a central catalog within its own embedded RAC database. This catalog automatically reflects all virtual full backups and other recovery assets created by ZDLRA. Database administrators (DBAs) interact with this catalog via standard RMAN commands for restores, without needing to know the internal complexities of virtual backups or delta pools. This centralization and automation significantly simplify backup administration, especially in large environments.

C. Control Flow and Policy Enforcement

Protection policies are defined on ZDLRA to manage recovery window goals, data retention periods on disk and tape/cloud, replication, and other backup lifecycle aspects. These policies are applied to protected databases. ZDLRA’s automated space management tasks (deletion of old backups, delta pool optimization) are driven by these policies. Enterprise Manager Cloud Control is typically used to manage and monitor ZDLRA and its policies.

ZDLRA’s policy-based management automates much of the backup lifecycle, abstracting complexity and ensuring adherence to defined service levels. Manual management of backup retention, replication, and tiering for hundreds of databases is error-prone and labor-intensive. ZDLRA allows administrators to define high-level protection policies (e.g., Gold, Silver, Bronze with different RPOs/retention ). The appliance then automatically enforces these policies, managing space, creating virtual full backups, replicating data, and archiving to secondary storage. This automation ensures consistency, reduces administrative burden, and helps organizations meet their data protection SLAs reliably.

The following table illustrates the interactive workflow between Delta Push and Delta Store step-by-step:

Table 3: Delta Push and Delta Store Interaction Workflow

Step No.	Action/Process	Responsible Component(s)	Key Outcome of the Step
1	Database Change	Protected Database	Data is modified.
2	Redo Sent	Protected Database (LGWR), ZDLRA (Delta Push Receiver)	Real-time redo data is transferred to and staged on ZDLRA.
3	Incremental Backup Initiated	Protected Database (RMAN)	Periodic incremental backup process is triggered.
4	Blocks Sent to ZDLRA	RA Backup Module, ZDLRA (Delta Push Receiver)	Changed data blocks are transmitted to ZDLRA.
5	ZDLRA Validates and Compresses	ZDLRA (Delta Store)	Incoming blocks are validated and compressed.
6	Blocks Indexed in Delta Pool	ZDLRA (Delta Store)	Changed blocks are added to the relevant delta pool and indexed.
7	Virtual Full Backup Created	ZDLRA (Delta Store)	A new virtual full backup is synthesized using new and existing blocks.
8	Catalog Updated	ZDLRA (Catalog)	New virtual full backup and archived logs become available in the RMAN catalog for recovery.

V. Conclusion: Technical Significance of Delta Push and Delta Store in ZDLRA

A. Summary of Core Architectural and Operational Principles

Delta Push is an efficient, dual-pronged mechanism (RMAN incrementals + real-time redo) for transferring only necessary changes from protected Oracle databases to ZDLRA. Delta Store is the intelligent, Oracle-aware repository that ingests these changes, versions data blocks at a granular level within delta pools, and synthesizes space-efficient virtual full backups. This “incremental forever” approach, powered by Delta Push and Delta Store, minimizes production impact, dramatically reduces RPO, and simplifies recovery.

B. Impact on Data Protection, Recovery Speed, and Efficiency

The combined effect of Delta Push and Delta Store fundamentally redefines Oracle database backup and recovery from a periodic, resource-intensive chore to a continuous, low-impact, and highly reliable data assurance service.

Data Protection: Near-zero data loss (sub-second RPO) is achieved thanks to Delta Push’s real-time redo transport component. Enhanced resilience against ransomware is offered through immutable backups and rapid recovery capabilities. Continuous validation guarantees backup integrity.
Recovery Speed: Fast restores from virtual full backups are possible without the need to apply numerous incrementals on the production server. The “time machine” feature enables rapid rollback. The Dialog Semiconductor case study showed approximately 4x faster restores.
Efficiency: Significant reductions in backup windows, production server load (CPU, I/O), network traffic, and storage consumption are achieved through the incremental forever strategy, Delta Push, virtual full backups, and advanced compression. Backup operations are offloaded from database servers.

Traditional backups are disruptive events. Delta Push makes data ingestion minimally impactful. Delta Store optimizes backup data for space and makes it immediately ready in a “full” format for rapid recovery. Automation and continuous validation add layers of reliability. This transforms the entire data protection posture from a necessary evil to an integrated, efficient, and highly effective component of Oracle database operations, as evidenced by benefits like those at Dialog Semiconductor and reported overall TCO reductions.

Furthermore, the Delta Push and Delta Store architecture offers a robust defense mechanism against modern cyber threats like ransomware, not just by enabling fast recovery, but by ensuring the integrity and availability of recovery points up to the last moments before an attack. Ransomware attacks aim to encrypt data and backups, making recovery difficult or impossible. ZDLRA’s real-time redo capture via Delta Push allows recovery to within seconds of an attack. Delta Store’s continuous validation helps detect corruption early. Features like backup immutability protect the backups themselves. The ability to rapidly restore a clean, virtual full backup to a secure location means organizations can avoid paying ransoms. Thus, the technical design directly translates into enhanced cyber resilience, a critical requirement in today’s threat landscape

The post Analysis of Delta Push and Delta Store Mechanisms within ZDLRA appeared first on Bugra Parlayan | Oracle Database & Exadata Blog.

Oracle 23ai Tablespace Shrink

Bugra Parlayan — Sun, 11 May 2025 16:57:59 +0000

1. Introduction to the Oracle 23ai Tablespace Shrink Feature

The Oracle 23ai database version introduces a significant innovation in storage management, addressing one of the most common and critical challenges faced by Database Administrators (DBAs): the Tablespace Shrink feature. In today’s digital age, data volumes are growing exponentially, leading to increased storage costs and greater complexity in managing storage infrastructures. This feature, introduced with Oracle 23ai, offers the potential to both optimize storage costs and improve database performance by effectively reclaiming unused disk space that accumulates within tablespaces over time. Oracle’s overall strategy is geared towards simplifying and automating database management tasks; the Tablespace Shrink feature aligns with this vision, enabling DBAs to work more efficiently.

The development of such intelligent storage management features is closely linked to Oracle’s “Autonomous Database” vision and the flexible storage demands in cloud computing. In cloud environments, storage resources are often based on a “pay-as-you-go” model, where every unused byte translates to unnecessary costs. Manual space management can be both time-consuming and prone to errors. Automated or semi-automated space reclamation mechanisms like Tablespace Shrink have the potential to significantly reduce these costs and the management burden. This aligns perfectly with Oracle’s cloud strategy and its goal of evolving databases into self-sufficient, resource-efficient “Autonomous” systems. Furthermore, this feature holds significant value not only for large-scale enterprise systems but also for environments with more limited resources, such as Oracle Database Free. The fact that the SYSAUX tablespace, which can grow over time due to components like AWR data and scheduled task logs, can also be shrunk offers users operating within the 12 GB user data limit of Oracle Database Free the ability to allocate more space for application data. This can be seen as a reflection of Oracle’s strategy to support the developer community and smaller-scale users.

2. What is Tablespace Shrink and Why is it Important for Database Optimization?

At its core, the Tablespace Shrink operation is the process of reorganizing data segments (e.g., tables, indexes) within an Oracle tablespace to consolidate unused free space accumulated at the end of data files, and then reducing the physical size of these data files to return the freed disk space to the operating system. Databases are dynamic structures; continuous data insertion, deletion, and update operations result in unused spaces, or “fragmentation,” within tablespaces. Tablespace Shrink aims to make this idle space reusable.

The importance of this feature can be summarized in several key points:

Space Reclamation and Cost Optimization: Its most apparent benefit is the recovery of unused (“empty”) disk space left behind after objects are deleted or moved elsewhere in the database. This translates directly into cost savings, especially in enterprise environments where storage costs are high or in cloud-based systems where capacity planning is critical. Studies have shown that tablespace shrinking operations reduce total storage costs and postpone the need for additional disk purchases. Oracle 23ai’s feature “allows you to reclaim unused space in the database, reducing costs and optimizing storage”.
Performance Improvement: Reducing data fragmentation can lead to noticeable increases in query performance. When data is stored in more organized and contiguous blocks, operations like full table scans or wide-ranging index range scans can be completed with fewer I/O operations. This shortens disk read times and allows for more efficient use of the database buffer cache. One case study reported that a 30% reduction in tablespace size led to a 25% improvement in query performance.
Reduced Backup and Restore Times: As the size of data files decreases, the time required to back them up and restore them in a disaster recovery scenario also decreases. This is a critical advantage, especially for large databases with tight backup windows and stringent recovery time objectives (RTO).
Simplified Management: A more organized, less fragmented, and optimized tablespace structure simplifies overall database administration.

Before Oracle 23ai, shrinking tablespaces was often a cumbersome and complex process. DBAs frequently encountered the $ORA-03297$: file contains used data beyond requested RESIZE value error when attempting to shrink a data file. This error occurred because there were still in-use data blocks anywhere in the data file beyond the desired shrink boundary. To overcome this, methods such as ALTER TABLE... MOVE, online table redefinition with the DBMS_REDEFINITION package, recreating the table using CREATE TABLE AS SELECT (CTAS), or exporting and re-importing the table were used. However, these older methods were generally complex, time-consuming, required significant additional temporary storage, and often meant downtime for the affected objects. The Tablespace Shrink feature significantly mitigates these challenges by offering a simpler, integrated solution.

Tablespace Shrink can be viewed not just as a reactive tool (i.e., reclaiming free space after a large data deletion) but also as part of a proactive storage management strategy. Databases are inherently dynamic; continuous data insertion, deletion, and update operations lead to fragmentation and the accumulation of unused space over time. Therefore, by periodically assessing the state of tablespaces using the DBMS_SPACE.SHRINK_TABLESPACE procedure in analyze mode (TS_SHRINK_MODE_ANALYZE) and applying the shrink operation (TS_SHRINK_MODE_SHRINK or other appropriate modes) when necessary—not just after major cleanup operations—storage health and efficiency can be consistently maintained at a high level. This proactive approach can help prevent sudden storage issues, performance degradation, and unnecessary storage costs. Performance improvements are not limited to end-user queries but can also provide indirect benefits in Oracle’s internal background processes, such as accessing data file headers and extent management. Compacting segments and reducing disorganization can mean less physical and logical I/O, not only for full table scans but also for index accesses and even Data Manipulation Language (DML) operations.

3. Enhancements in Oracle 23ai Version for Tablespace Management

The Tablespace Shrink feature is one of the significant innovations introduced with the Oracle database 23ai version. With this release, the long-awaited capability for DBAs to safely and effectively reclaim unused space in tablespaces has been directly integrated into the database core. Initially, this feature was primarily designed for Bigfile tablespaces. Bigfile tablespaces consist of a single large data file (supporting up to 4G blocks) and are intended to simplify the management of very large databases compared to traditional Smallfile tablespaces (which can contain multiple, smaller data files). The default creation of essential system tablespaces like SYSAUX, SYSTEM, and USER as Bigfile tablespaces in Oracle Database 23ai also indicates a trend in this direction.

The scope of the feature was significantly expanded with a subsequent update to Oracle Database 23ai, Release Update 23.7 (and later). This update extended Tablespace Shrink functionality to support Smallfile tablespaces as well. This enhancement is of great importance, especially for organizations with legacy systems that use multiple small data files or for environments that prefer the Smallfile tablespace structure due to specific application requirements. Although the default tablespace type in Oracle 23ai is set to Bigfile, the option to create Smallfile tablespaces still exists, and this new development makes storage management for Smallfile tablespaces more flexible and efficient.

Oracle’s strategy of first introducing a feature for more modern and relatively simpler data structures like Bigfile tablespaces, and then extending support to more traditional and potentially complex structures like Smallfile tablespaces, often reflects a phased rollout and maturation approach. Bigfile tablespaces, consisting of a single large data file, present less complexity for implementing and testing a shrink mechanism compared to Smallfile tablespaces, which involve the interaction of multiple data files. By first introducing a new and significant feature in this more controlled environment, Oracle aims to ensure stability and make improvements based on user feedback. Smallfile tablespaces, on the other hand, can contain multiple data files, and how these are managed during a shrink operation (e.g., consolidating data into fewer files, shrinking each file individually) can introduce additional operational complexities. Therefore, the introduction of Smallfile support with a Release Update (RU) after the feature’s initial debut has allowed for further maturation of the feature and its confident adoption by a broader user base.

The delivery of Smallfile support via a Release Update like 23.7 also reflects Oracle’s “continuous innovation” model. Oracle delivers both bug fixes and new features and enhancements through RUs released between major database versions. The fact that Smallfile support for a significant feature like Tablespace Shrink came with an RU once again underscores the importance for users to keep their systems at current RU levels to benefit from such valuable developments. This also highlights the necessity for DBAs to regularly follow Oracle’s feature announcements, release notes, and support documentation.

4. Advantages of Oracle 23ai Tablespace Shrink

The Tablespace Shrink feature in Oracle 23ai offers several tangible advantages to database administrators and, consequently, to organizations. These benefits range from storage efficiency to performance gains:

Optimized Storage Utilization and Cost Reduction: The most fundamental and direct benefit is the increased disk space efficiency achieved by reclaiming “empty” space that accumulates in tablespaces over time and is not actively used. This idle space, resulting from deleted tables, updated data, or reorganized indexes, can be effectively returned to the operating system using the DBMS_SPACE.SHRINK_TABLESPACE procedure. This translates directly into financial savings, especially in large data environments where storage costs are a significant expense, or on cloud platforms where a pay-as-you-go model is prevalent. Many sources state that this feature “helps both reduce costs and optimize database storage” and “lowers total storage costs”. Oracle’s own promotions also state, “This feature allows you to reclaim unused space in the database, reducing costs and optimizing storage”.
Enhanced Query Performance: The Tablespace Shrink operation reduces data fragmentation by reorganizing and compacting segments. When logically related data blocks are also physically located in closer proximity on the disk, Oracle’s data access becomes faster. This leads to shorter completion times for I/O-intensive operations, particularly full table scans and wide-ranging index range scans. Fewer and more efficient I/O operations result in improved overall query response times. A case study reported that a 30% reduction in tablespace size led to a 25% improvement in query performance. Considering that fragmentation “increases data retrieval times and slows query performance” , this improvement is quite significant.
Faster Backup and Recovery Operations: When the total size of data files is reduced, the time required to back up these files (e.g., with RMAN) and restore them in a disaster scenario is also significantly shortened. Less data to read, process, and write contributes to more effective use of backup windows and improves critical business continuity metrics like Recovery Time Objectives (RTO) and Recovery Point Objectives (RPO). One study noted that the shrink operation “reduced backup times by 20%”.
Simplified Storage Management: Especially in Smallfile tablespaces composed of numerous small data files, the shrink operation (if it can consolidate data files) can create a structure with fewer, more organized data files. This simplifies overall file system management and tablespace maintenance tasks. While Bigfile tablespaces already have a single data file, making this advantage more pertinent to Smallfile environments, improving the internal organization of a Bigfile also indirectly contributes to manageability.

These advantages are often interconnected and create a combined effect. For instance, reduced fragmentation (leading to performance gains) and smaller data file sizes (providing space savings) directly contribute to shorter backup times. The shrink operation consolidates free space by moving objects towards the beginning of the data file. This ensures that data that should be logically contiguous is also physically closer, reducing fragmentation. Reduced fragmentation allows queries to access data faster because fewer I/O operations and extent reads are required. Simultaneously, the total size of the data file decreases. Smaller data files mean that backup tools like RMAN need to read and write less data, which shortens backup times.

The concept of “cost optimization” is not limited to direct disk space costs. It can also mean less administrative effort (thus reducing human resource costs) and potentially lower licensing costs (if some cloud services or third-party tools have licensing models based on storage size). While manual shrink operations or complex scripts require time and deep expertise , the DBMS_SPACE.SHRINK_TABLESPACE procedure greatly simplifies this process, saving DBAs valuable time.

5. Working Mechanism of Tablespace Shrink in Oracle 23ai

The fundamental working principle of the Tablespace Shrink feature in Oracle 23ai is based on two main steps: first, reorganizing segments (tables, indexes, etc.) within the tablespace to consolidate free space; and second, physically shrinking the data file(s) at the end where this consolidated free space has accumulated, thereby returning disk space to the operating system.

Object Movement (Online/Offline) and Its Core Logic: The first and most critical step of the procedure is to efficiently move various segments (tables, indexes, LOB segments, etc.) within the tablespace towards the beginning of the data file or files. This movement ensures that unused free spaces, scattered among data blocks and causing “fragmentation,” are gathered at the end of the data file(s). This creates a large, contiguous block of free space at the end of the data file that can then be shrunk.

This object movement is performed via Oracle’s Data Definition Language (DDL) commands, either online or offline, depending on the shrink_mode parameter passed to the DBMS_SPACE.SHRINK_TABLESPACE procedure.

Online Move: When TS_SHRINK_MODE_ONLINE or TS_SHRINK_MODE_AUTO (in case of a successful online attempt) modes are selected, Oracle attempts to move objects while allowing DML (INSERT, UPDATE, DELETE) operations and queries on the respective objects. This is important for maintaining application availability. However, online operations generally consume more system resources (CPU, I/O, redo log) and can be slower than offline operations due to certain locking mechanisms.
Offline Move: When TS_SHRINK_MODE_OFFLINE or TS_SHRINK_MODE_AUTO (if the online attempt fails) modes are selected, objects are moved offline. This means access to the respective objects will be blocked during the move operation. However, offline movement is generally faster and can, in some cases, achieve more effective compaction.

It is noted that for Smallfile tablespaces, these move operations trigger well-known DDL commands like ALTER TABLE MOVE ONLINE and ALTER INDEX REBUILD ONLINE in the background. This indicates that the Tablespace Shrink feature doesn’t work “magically” but rather intelligently orchestrates Oracle’s existing and proven DDL capabilities.

Data File Resizing Process: After objects are successfully moved and unused free space is consolidated at the end of the data file(s), the second main step of the DBMS_SPACE.SHRINK_TABLESPACE procedure comes into play: reducing the physical size of the data file(s). This operation works with a logic similar to the ALTER DATABASE DATAFILE RESIZE command. However, unlike using this command directly, the Tablespace Shrink feature performs the object movement step beforehand, thus preventing the ORA-03297 error (file contains used data beyond requested RESIZE value). A successful object movement “cleans” the end of the data file, allowing the resize operation to be performed safely and the reclaimed free space to be physically returned to the operating system.

The use of “Online DDL” demonstrates Oracle’s emphasis on uninterrupted operations and high availability. However, the term “online” does not always mean “zero impact.” While online DDL operations allow DML operations to continue , they perform intensive background tasks such as copying data blocks, logging, and metadata updates. These activities create an additional load on the system in terms of CPU, I/O, and redo log generation. Especially on heavily loaded systems, this extra load can temporarily affect overall database performance. Therefore, even in online mode, it is generally recommended to perform Tablespace Shrink operations during periods of lower system activity or within planned maintenance windows.

In Smallfile tablespaces with multiple data files, the shrink operation may attempt to consolidate these files and reduce the number of data files. This aims not only to reclaim disk space but also to simplify file management. However, some observations suggest that “unlike bigfile tablespace shrink, smallfile tablespace shrink doesn’t appear to release much free space”. This might indicate that in Smallfile tablespaces, the priority is sometimes to reduce data files to a minimum number and then shrink the remaining files, rather than raw space reclamation. Especially in older systems with many small data files, this consolidation step can be considered a managerial gain, even if there isn’t a large reduction in the total number of used blocks.

6. Deep Dive into `DBMS_SPACE.SHRINK_TABLESPACE` Procedure

The central tool for performing tablespace shrink operations in Oracle 23ai is the SHRINK_TABLESPACE PL/SQL procedure, located within the DBMS_SPACE package. This procedure supports both Bigfile tablespaces and, from Oracle 23ai Release Update 23.7 onwards, Smallfile tablespaces. It is primarily used for two main purposes: first, to analyze a tablespace before actually shrinking it, reporting potential space savings and movable objects (ANALYZE mode); and second, to directly perform the shrink operation (SHRINK modes).

Parameters and Their Meanings: The DBMS_SPACE.SHRINK_TABLESPACE procedure accepts various parameters to control its behavior. A correct understanding of these parameters is crucial for the effective use of the feature.

Parameter Name	Data Type	IN/OUT	Default Value	Description
`ts_name`	VARCHAR2	IN	None	The name of the tablespace to be shrunk or analyzed.
`shrink_mode`	NUMBER	IN	`DBMS_SPACE.TS_SHRINK_MODE_ONLINE`	Determines the operating mode of the procedure. One of the following constants is used.
			`DBMS_SPACE.TS_SHRINK_MODE_ANALYZE`	Analyzes the tablespace, reports potential savings, but performs no shrink. (Formerly `DBMS_SPACE.TS_MODE_ANALYZE`)
			`DBMS_SPACE.TS_SHRINK_MODE_ONLINE`	Default mode. Attempts to move objects online (except Index-Organized Tables, which are moved offline). Aims for minimal downtime. (Formerly `DBMS_SPACE.TS_MODE_SHRINK`)
			`DBMS_SPACE.TS_SHRINK_MODE_AUTO`	First attempts an online move; if it fails, switches to an offline move. (Formerly `DBMS_SPACE.TS_MODE_SHRINK_FORCE`)
			`DBMS_SPACE.TS_SHRINK_MODE_OFFLINE`	Moves objects offline. Offers the best shrink result and performance in terms of processing time, but objects become inaccessible during the operation. (Distinct naming for 23.7 and later)
`target_size`	NUMBER	IN	`DBMS_SPACE.TS_TARGET_MAX_SHRINK`	Specifies the target size of the tablespace in bytes after shrinking. The default value aims to shrink to the smallest possible size.
`shrink_result`	CLOB	OUT	None (Optional)	An output parameter that returns the result of the operation as a CLOB (Character Large Object). In analyze mode, it contains movable objects, potential savings; in shrink modes, it includes the number of moved objects, old/new size, etc.

Note on the ITERATIONS Parameter: Although some community resources or blog posts mention an ITERATIONS parameter for the SHRINK_TABLESPACE procedure, Oracle’s official DBMS_SPACE package reference documentation does not list such a parameter for this procedure. The concept of ITERATIONS is more commonly associated with the Automatic SecureFiles Shrink feature. Therefore, it should be assumed that an ITERATIONS parameter is not a standard feature of DBMS_SPACE.SHRINK_TABLESPACE, and this information is likely a misunderstanding or pertains to a custom development environment. Users are always advised to refer to official Oracle documentation.

Usage Scenarios and SQL Examples: Below are SQL examples demonstrating how the DBMS_SPACE.SHRINK_TABLESPACE procedure can be used in different scenarios. These examples assume DBMS_OUTPUT is used to view the results; the shrink_result CLOB parameter can be used to retrieve similar information programmatically.

Analyze Mode Usage (To See Potential Savings):

SQLSET SERVEROUTPUT ON; DECLARE v_result CLOB; BEGIN DBMS_SPACE.SHRINK_TABLESPACE( ts_name => 'USERS', shrink_mode => DBMS_SPACE.TS_SHRINK_MODE_ANALYZE, shrink_result => v_result ); DBMS_OUTPUT.PUT_LINE('---ANALYZE RESULT---'); DBMS_OUTPUT.PUT_LINE(v_result); END; /

This command analyzes the ‘USERS’ tablespace and reports how much space can be saved and which objects can be moved, but does not perform any shrinking.
Default Shrink Mode (Online Move, Maximum Possible Shrink):

SQLSET SERVEROUTPUT ON; DECLARE v_result CLOB; BEGIN DBMS_SPACE.SHRINK_TABLESPACE( ts_name => 'USERS', shrink_result => v_result -- shrink_mode and target_size take default values ); DBMS_OUTPUT.PUT_LINE('---SHRINK RESULT---'); DBMS_OUTPUT.PUT_LINE(v_result); END; /

This command shrinks the ‘USERS’ tablespace in the default TS_SHRINK_MODE_ONLINE mode and with the TS_TARGET_MAX_SHRINK target size.
Shrinking to a Specific Target Size (Online Mode):

SQLSET SERVEROUTPUT ON; DECLARE v_result CLOB; BEGIN DBMS_SPACE.SHRINK_TABLESPACE( ts_name => 'MYDATA', shrink_mode => DBMS_SPACE.TS_SHRINK_MODE_ONLINE, target_size => 100 * 1024 * 1024, -- 100 MB target size shrink_result => v_result );
DBMS_OUTPUT.PUT_LINE('---SHRINK RESULT---'); DBMS_OUTPUT.PUT_LINE(v_result); END; /

This example aims to shrink the ‘MYDATA’ tablespace to approximately 100 MB in online mode.
AUTO Mode Usage (Try Online First, then Offline if Necessary):

SQLSET SERVEROUTPUT ON; DECLARE v_result CLOB; BEGIN DBMS_SPACE.SHRINK_TABLESPACE( ts_name => 'USERS', shrink_mode => DBMS_SPACE.TS_SHRINK_MODE_AUTO, shrink_result => v_result ); DBMS_OUTPUT.PUT_LINE('---SHRINK RESULT (AUTO MODE)---'); DBMS_OUTPUT.PUT_LINE(v_result); END; /
OFFLINE Mode Usage (For Best Shrink, if Downtime is Acceptable):

SQLSET SERVEROUTPUT ON;
DECLARE v_result CLOB;
BEGIN DBMS_SPACE.SHRINK_TABLESPACE( ts_name => 'USERS', shrink_mode => DBMS_SPACE.TS_SHRINK_MODE_OFFLINE, shrink_result => v_result ); DBMS_OUTPUT.PUT_LINE('---SHRINK RESULT (OFFLINE MODE)---'); DBMS_OUTPUT.PUT_LINE(v_result); END; /

The presence of the shrink_result CLOB parameter allows for more detailed and structured retrieval of the operation’s results (often in JSON or XML format, though the documentation doesn’t specify the format), enabling these results to be programmatically processed by automation scripts or custom reporting tools. For example, within a PL/SQL block, this CLOB content can be parsed to extract metrics such as the number of objects moved, the amount of space reclaimed, and processing time. This information can then be saved to custom log tables, emailed to relevant personnel, or integrated into a monitoring system.

The introduction of the TS_SHRINK_MODE_AUTO mode demonstrates Oracle’s effort to provide an intelligent default behavior: “first try with minimal downtime and in the safest way; if the target cannot be reached with this method or some objects cannot be moved, then automatically switch to a more effective but potentially downtime-inducing method.” This somewhat reduces the burden on the database administrator to decide “online or offline?” for each object or situation. In this mode, Oracle first attempts an online move; if the online move fails for an object (e.g., the object type does not support online move, or the operation cannot be performed due to a lock), it then attempts an offline move for the same object. This is a “best-effort” approach aimed at balancing both availability and space reclamation.

7. Oracle Enterprise Manager (EM) Integration for Tablespace Shrink

Oracle Enterprise Manager (EM) is a comprehensive toolset for centrally managing, monitoring, and maintaining Oracle databases and other Oracle products. In the context of the Tablespace Shrink feature, EM can assist DBAs, particularly with analysis, visualization, and indirect monitoring capabilities.

Tablespace Occupancy and Content Analysis (Extent Map): EM Cloud Control offers powerful tools to visualize the current occupancy rates of tablespaces, the distribution of segments they contain, and, most importantly, where and how much free space (fragmentation) exists. Specifically, the “Extent Map” view graphically displays the status (in use, free) of each extent (space allocation unit) within a tablespace. Free spaces are typically colored green, and this map is very useful for understanding whether a shrink operation is needed and where potential savings are concentrated. DBAs can usually navigate to the relevant tablespace via the Administration -> Storage -> Tablespaces menu path in the EM interface, then select the tablespace and choose an option like “Show Tablespace Content” to access this detailed extent map. This visual analysis is an important step for assessing the situation before running the DBMS_SPACE.SHRINK_TABLESPACE procedure.

Space Gain Analysis with Segment Advisor: Segment Advisor, an integral part of EM, is an automated tool that analyzes how much space can be gained by reorganizing or compressing tablespaces and individual segments (tables, indexes, etc.) within them. Segment Advisor evaluates the fragmentation level, number of empty blocks, and other metrics for specific segments, providing concrete recommendations on which segments would benefit most from a shrink operation. These recommendations, similar to the information provided by the TS_SHRINK_MODE_ANALYZE mode of the DBMS_SPACE.SHRINK_TABLESPACE procedure, can be used to assess space reclamation potential and plan the shrink strategy.

Monitoring and Management of Shrink Operations via EM: Based on available research materials and Oracle documentation , there is no clear evidence that Enterprise Manager has the capability to initiate, schedule, or actively manage the DBMS_SPACE.SHRINK_TABLESPACE procedure directly through a graphical user interface (GUI) wizard, a dedicated button, or a menu option. Although EM typically integrates new PL/SQL-based management features into its GUIs over time, such direct integration does not appear to exist for this specific feature (at least according to the cited sources). Some inaccessible resources might contain more details, but this conclusion is based on the currently available information.

However, this does not mean EM is entirely irrelevant to the shrink process. EM’s general database monitoring capabilities (e.g., performance monitoring pages, active sessions list, long-running SQL operations reports) can be used to indirectly monitor the progress and system impact (CPU usage, I/O activity, waits, etc.) of a DBMS_SPACE.SHRINK_TABLESPACE operation initiated via SQL*Plus, SQL Developer, or an automation script. For example, records related to the shrink operation in the V$SESSION_LONGOPS view (if Oracle flags this operation as a long-running one) might be reflected in EM’s performance monitoring interfaces.

The absence of a direct “run” button or a dedicated management screen in EM for DBMS_SPACE.SHRINK_TABLESPACE might suggest that Oracle prefers such powerful operations—which can potentially be lengthy, resource-intensive, and even cause downtime—to be performed under the full and conscious control of the DBA, typically via scripts and with a full understanding of all parameters. GUIs can sometimes struggle to offer the user the full flexibility and control level provided by PL/SQL procedures. Therefore, Oracle might expect DBAs to execute these critical operations via SQL*Plus, SQL Developer, or automation scripts, carefully specifying all parameters and modes. In this scenario, EM primarily takes on the role of performing detailed analysis before these operations (via Segment Advisor, Extent Map) and monitoring the overall system status and performance during/after the operation.

Nevertheless, given Oracle’s trend of integrating new database features into Enterprise Manager GUIs over time, it is quite plausible that future EM versions or add-ons for database management packs will offer a more integrated management interface for the DBMS_SPACE.SHRINK_TABLESPACE feature.

8. Key Considerations and Limitations for Tablespace Shrink

While the Tablespace Shrink feature in Oracle 23ai offers significant advantages in storage management, understanding certain limitations and crucial considerations is necessary for its successful and trouble-free implementation.

Unsupported Object Types: The DBMS_SPACE.SHRINK_TABLESPACE procedure is not capable of moving all object types within a tablespace. When the procedure is run in TS_SHRINK_MODE_ANALYZE mode, the shrink_result CLOB output will include a list of objects that cannot be moved or are unsupported. For example, it has been noted that cluster tables and some Advanced Queuing (AQ) tables may not be supported by this procedure. Additionally, tables containing columns with the LONG data type, cluster tables, and tables with what Oracle terms “reservable columns” cannot be moved even in TS_SHRINK_MODE_OFFLINE (offline move) mode. In tablespaces containing such objects, the shrink operation may not complete fully as expected, or these objects may remain in their original locations.

Impact on LOB Segments, Index-Organized Tables (IOTs), and Partitioned Tables: The behavior of these commonly used special object types during a shrink operation and the specific conditions they require are as follows:

Object Type	Behavior/Support with `TS_SHRINK_MODE_ONLINE`	Behavior/Support with `TS_SHRINK_MODE_OFFLINE`	Special Notes/Considerations
LOB (Large Object) Segments	Movable.	Movable.	Row movement (`ROW MOVEMENT`) must be enabled on the relevant tables for effective shrinking. Separate automatic shrink mechanisms also exist for SecureFiles LOBs (`DBMS_SPACE.SECUREFILE_SHRINK_ENABLED()`).
Index-Organized Tables (IOT)	Moved offline.	Moved offline.	Even if `TS_SHRINK_MODE_ONLINE` is selected, IOTs are moved offline, implying downtime for IOT access. Secondary index and mapping table segments cannot be shrunk individually; shrinking the primary segment affects them as well.
Partitioned Tables	Tablespace-level shrink affects all partitions within.	Tablespace-level shrink affects all partitions within.	The `DBMS_SPACE` package generally supports partitioned structures. The `SHRINK_TABLESPACE` procedure takes the tablespace name; it’s unclear if individual partitions can be directly shrunk with this procedure, likely requiring methods like `ALTER TABLE... SHRINK SPACE PARTITION`.
Cluster Tables	May not be supported / Unmovable.	Unmovable.	The `ANALYZE` mode result should be checked.
Advanced Queuing (AQ) Tables	Some AQ tables may not be supported / Unmovable.	Some AQ tables may not be supported / Unmovable.	The `ANALYZE` mode result should be checked.
Tables with `LONG` Data Type	Unmovable.	Unmovable.

Differences and Impacts of Online vs. Offline Move Modes:

Online Modes (TS_SHRINK_MODE_ONLINE, online part of TS_SHRINK_MODE_AUTO): These modes aim to maximize application availability; DML operations and queries can continue while objects are being moved. However, this flexibility comes at a cost: background operations like data copying and index updating can create additional CPU, I/O, and redo log generation load on the system. Index-Organized Tables (IOTs) are moved offline even if this mode is selected. An important point is that online moves performed by SHRINK_TABLESPACE may not have all the restrictions of the traditional ALTER TABLE... MOVE ONLINE command; the ANALYZE mode will identify objects unsuitable for online move.
Offline Modes (TS_SHRINK_MODE_OFFLINE, offline part of TS_SHRINK_MODE_AUTO): These modes are generally faster and can provide better compaction (thus more space reclamation). This is because they avoid the extra work online modes do to ensure DML compatibility. However, the main disadvantage is that DML and query access to the moved objects is blocked during the operation, meaning downtime. When using TS_SHRINK_MODE_AUTO, it should be noted that if an online move attempt fails, the procedure will automatically switch to an offline move, which could lead to unexpected downtime.

Characteristics of Shrink Operations on Smallfile Tablespaces: Shrinking Smallfile tablespaces can exhibit some different behaviors compared to Bigfile tablespaces. Observations suggest that Smallfile tablespace shrink operations might sometimes not reclaim as much free space as expected compared to Bigfile operations. This can occur especially when the goal is to consolidate multiple data files into a single file or when the data distribution is very complex. A single run of DBMS_SPACE.SHRINK_TABLESPACE might not consolidate all data files as desired; the operation might need to be repeated several times to achieve the desired level of consolidation. Additionally, it should be noted that during a shrink operation on Smallfile tablespaces, the sizes of some data files might temporarily increase and then decrease depending on the new layout of the objects within them.

Potential Error Conditions and Partial Failure Scenarios: A Tablespace Shrink operation may not always complete with 100% success. The operation can partially fail; for example, if an object cannot be moved (due to being locked by another session, being an unsupported object type, etc.), the procedure might report an error for that object. However, thanks to other successfully moved objects, the data file might still be shrunk to some extent. A common error, $ORA-00054$: resource busy and acquire with NOWAIT specified or timeout expired, can occur if an object is locked by another session and the shrink operation cannot wait for this lock. If the operation is interrupted in some way (e.g., manually stopped by the DBA), online DDL operations completed up to that point (i.e., successfully moved objects) are not rolled back; these objects remain in their new locations, and a subsequent shrink attempt will benefit from this, meaning previous gains are preserved even if the operation doesn’t resume from where it left off.

AUTOEXTEND Settings: If the data files of the tablespace to be shrunk are set to AUTOEXTEND OFF (automatic growth disabled), there might not be enough free space left in the tablespace for segments to grow after the shrink operation. In this case, the DBA might need to manually increase the tablespace or data file size to accommodate future data growth.

Shrinking the SYSAUX Tablespace: The ability to shrink the SYSAUX tablespace—generally considered a system tablespace that houses important metadata like AWR (Automatic Workload Repository) data, statistics history, and scheduled job information—using the DBMS_SPACE.SHRINK_TABLESPACE procedure is a significant capability. This is particularly valuable in environments with limited storage space, such as Oracle Database Free, because the SYSAUX tablespace counts towards the 12 GB total user data limit in this edition and can grow over time, reducing the space available for application data.

The statement “Online moves via SHRINK_SPACE don’t have all of the restrictions associated with a conventional ALTER TABLE… MOVE” implies that although the DBMS_SPACE.SHRINK_TABLESPACE procedure uses the ALTER TABLE... MOVE ONLINE command in the background , it might be able to bypass some of the known restrictions of this standard command or operate with a different internal logic. This suggests that Oracle might have developed special optimizations or different lower-level internal mechanisms for this new procedure. The fact that the ANALYZE mode provides a list of “unsupported objects” confirms that there are still some objects that even this special mechanism cannot move. This indicates that Oracle is pushing the boundaries of online operations, but some fundamental limits still apply.

9. Performance Impacts and Evaluations of Tablespace Shrink

The performance of the Tablespace Shrink operation itself during its execution, as well as its potential effects on the overall operational performance of the database, are important evaluation criteria for DBAs.

Performance of the Shrink Operation Itself (Duration, Resource Usage): The tablespace shrink operation can be time-consuming, especially for very large tablespaces or when numerous or very large segments need to be moved. The completion time of the operation can vary greatly depending on the total amount of data to be moved, the number of segments, the selected shrink_mode (online modes are generally slower than offline modes), the overall load on the system, and the hardware capacity (especially I/O performance).

The progress of the operation can be monitored using Oracle’s V$SESSION_LONGOPS dynamic performance view, which is designed for tracking long-running operations. By searching for an OPNAME like ‘%Table%’ or a similar identifier related to the shrink operation (e.g., Table/Index Maintenance) in this view, one can get information about the current phase of the operation, how much has been completed, and the estimated remaining time.

As a general rule, the TS_SHRINK_MODE_OFFLINE (offline move) mode usually completes faster and potentially provides better compaction (more space reclamation) than the TS_SHRINK_MODE_ONLINE (online move) mode. The primary reason for this is that online modes use additional synchronization, lock management, and versioning mechanisms to allow DML (Data Manipulation Language) operations to continue during the process, which adds overhead and complexity. The offline mode, by exclusively locking the object, avoids these extra tasks and performs the move operation more directly.

Potential Impacts on Overall Database Performance (CPU, I/O, Redo):

Impact During Online Modes: When using TS_SHRINK_MODE_ONLINE or TS_SHRINK_MODE_AUTO (during its online phase), object movement operations can consume significant CPU and I/O resources in the background. Additionally, redo log generation will increase for every data block moved. This additional load, especially on already busy systems, can negatively affect the performance of other user sessions and application processes. The increase in redo log generation can increase log switch frequency, strain archiving processes, and potentially cause apply lag in physical standby database environments like Data Guard. Therefore, even when using online modes, it is highly recommended to perform such operations during periods of lower system load or within planned maintenance windows.
Impact During Offline Modes: When using TS_SHRINK_MODE_OFFLINE or TS_SHRINK_MODE_AUTO (during its offline phase), there will be no direct DML-related performance issues for the moved objects, as access to them is completely blocked during the operation. However, the operation itself will still consume CPU and I/O resources. The main impact of this mode is that the affected objects become unusable until the operation is complete, which translates to application downtime.
Post-Operation Performance: After a successful Tablespace Shrink operation, positive effects on database performance are generally expected. Reduced data fragmentation and the tighter, more organized packing of data on disk can enable query operations (especially full table scans and wide-ranging index range scans) to run with less I/O and therefore faster. This can lead to improvements in overall application response times.

Upon reviewing available research materials and Oracle documentation , comprehensive benchmark results or official performance studies quantifying the specific impact of DBMS_SPACE.SHRINK_TABLESPACE in its different modes on performance metrics like CPU usage, I/O rates, and redo log generation are not found. Such detailed analyses are typically found in Oracle’s internal tests, beta programs, or specifically published white papers. The most accurate approach for users is to test this operation in their own environments, on test systems that mimic production, to observe its specific performance impacts.

The statement “The best shrink result and performance is achieved with TS_SHRINK_MODE_OFFLINE” should be understood such that “performance” refers to the completion speed of the shrink operation and its space reclamation effectiveness, not the overall operational performance of the database. The offline mode, by not dealing with DML synchronization or the complexities of online DDL operations, generally takes less time to move and compress the same amount of data and can consolidate free space more effectively than online modes.

10. Best Practices and Advanced Scenarios for Tablespace Shrink

To make the most of the Oracle 23ai Tablespace Shrink feature and minimize potential issues, it’s important to follow certain best practices and plan the operation carefully.

Pre- and Post-Shrink Checks:

Pre-Operation Steps:
1. Analysis: Run the DBMS_SPACE.SHRINK_TABLESPACE procedure in TS_SHRINK_MODE_ANALYZE mode to thoroughly examine the potential space savings, movable objects, non-movable (unsupported) objects, and the Oracle-recommended target size for the tablespace to be shrunk. This analysis will give a preliminary idea of whether the operation will be beneficial and how long it might take.
2. Visual Inspection (EM): Use Oracle Enterprise Manager (EM) to visually confirm the current occupancy rate of the tablespace, the distribution of segments, and especially the “Extent Map” view to see where fragmentation is concentrated and how free space is distributed.
3. Segment Advisor: Utilize the Segment Advisor tool within EM to perform a more detailed space gain analysis on individual segments within the tablespace and evaluate recommendations on which segments would benefit most from the shrink operation.
4. Backup: As with any major database operation, ensure that a current and verified backup of the tablespace to be processed, and ideally the entire database, has been taken. This provides a fallback option in case of unexpected issues (a general DBA best practice).
5. AUTOEXTEND Settings: Check the AUTOEXTEND (automatic growth) settings of the data files belonging to the tablespace to be shrunk. If AUTOEXTEND OFF, there might not be enough free space left for segments to grow after the shrink; plan for manual size increases post-operation if necessary.
6. Resource Planning: Evaluate the potential impact of the operation on system resources (CPU, I/O) and plan to run it during a time of lower system load or within a planned maintenance window.
Post-Operation Steps:
1. Verification (Dictionary Views): Query Oracle data dictionary views like DBA_FREE_SPACE, DBA_DATA_FILES, and DBA_TABLESPACES to verify the changes in tablespace and data file sizes, the reclaimed free space, and the new high-water mark (HWM) levels.
2. Verification (EM): Re-examine the “Extent Map” view of the tablespace via EM to visually confirm that free space has been consolidated as expected and the data file has shrunk.
3. Update Statistics: Since the shrink operation changes the physical layout of tables and indexes, it can cause statistics on these objects to become stale. For the Oracle Optimizer to generate accurate query execution plans, statistics for the relevant objects (especially moved tables and rebuilt or affected indexes) must be updated using the DBMS_STATS package after the operation. This step is critical for fully realizing the expected performance improvements.
4. Monitor Application Performance: Closely monitor the performance of relevant applications for a period after the shrink operation to observe any unexpected behavior or performance changes.
5. Check Alert Log and Trace Files: Check the Oracle alert log and relevant trace files for any error or warning messages during or after the operation.

Mode Selection Based on Different Scenarios: The following table summarizes the key features of different shrink_mode options and in which scenarios they might be preferred:

Mode Name	Basic Function	Downtime Status	Performance (Operation Speed)	Space Gain Potential	Advantages	Disadvantages	Recommended Use Cases
`TS_SHRINK_MODE_ANALYZE`	Analyzes, reports, does not shrink.	None	N/A	N/A	See potential gain and problematic objects without risk.	Does not shrink.	For planning and evaluation before every shrink operation.
`TS_SHRINK_MODE_ONLINE`	Moves objects online (except IOTs).	Minimal (downtime for IOTs).	Medium	Good	High availability, DML continues.	Can be slower than offline, consumes more resources.	7/24 systems where application downtime is unacceptable.
`TS_SHRINK_MODE_AUTO`	Tries online first, then offline if fails.	Potential (if switches to offline).	Variable	Variable (depends on online/offline success)	Flexibility, automatic decision-making.	Can lead to unexpected downtime, exact mode used is not fully predictable.	Situations where the state is uncertain, or both availability and good shrinkage are desired, but controlled downtime is acceptable.
`TS_SHRINK_MODE_OFFLINE`	Moves objects offline.	Yes (for moved objects).	High	Best	Fastest operation, best compaction.	Access to relevant objects blocked during operation.	Planned maintenance windows, situations where downtime is acceptable and maximum space gain is targeted.

Export to Sheets

Maintenance Windows and Scheduling Recommendations: The Tablespace Shrink operation, especially when run on large tablespaces or when many objects need to be moved, can consume significant system resources and take a long time. Therefore, it is highly recommended that the operation (even when using online modes like TS_SHRINK_MODE_ONLINE) be performed preferably during times of lowest overall system load or within planned maintenance windows. This approach minimizes the potential performance impact of the shrink operation on other critical application processes and allows sufficient time to intervene and resolve any unexpected issues (e.g., the operation taking much longer than expected or stopping due to an error).

Advanced Scenarios (Inferential):

Phased Shrinking of Very Large Tablespaces: Instead of shrinking a very large tablespace (terabytes in size) in a single operation, dividing the process into several phases using the target_size parameter can be a more controlled approach. For example, a shrink of 10-20% of the tablespace could be targeted in each step. This allows for closer monitoring of the system impact of each phase, makes the operation duration more manageable, and reduces potential risks.
Scripting and Automation: The DBMS_SPACE.SHRINK_TABLESPACE procedure can be easily automated via PL/SQL scripts. In particular, the shrink_result CLOB output can be programmatically processed to log information such as the success of the operation, reclaimed space, and the number of objects moved, or to report these to DBAs. For instance, an automation framework could be established to periodically analyze all tablespaces with TS_SHRINK_MODE_ANALYZE, automatically shrink tablespaces where potential space savings above a certain threshold (e.g., more than 20%) are detected (perhaps first with TS_SHRINK_MODE_ONLINE), or at least send an alert/notification to the DBA.
Integration with Information Lifecycle Management (ILM): In environments where Information Lifecycle Management (ILM) strategies are implemented, the Tablespace Shrink feature can play a significant role. For example, tablespaces containing old data that is archived or completely deleted from the system after a certain period can be regularly subjected to a shrink operation after this data is cleared. This allows the valuable disk space occupied by these tablespaces to be reclaimed, keeping storage efficiency consistently high.

Finally, a point frequently emphasized in Oracle community forums and user experiences is the importance of testing such new features first in non-production (test, development) environments with data similar in size and structure to production data. Tests conducted with small-sized test data may not always accurately reflect the feature’s behavior, potential performance impacts, or possible bottlenecks in large-scale production systems. Therefore, comprehensive testing is indispensable for a successful production implementation.

11. Conclusion and Evaluation of Oracle 23ai Tablespace Shrink

The Tablespace Shrink feature introduced with Oracle 23ai is a groundbreaking innovation for reclaiming unused space in tablespaces, addressing a significant long-standing challenge for Database Administrators (DBAs). This feature offers the potential to use storage resources more efficiently, thereby reducing costs and potentially enhancing system performance.

The introduction of the feature for Bigfile tablespaces in the initial release of Oracle 23ai, followed by its extension to Smallfile tablespaces with Release Update 23.7, demonstrates Oracle’s commitment to this functionality and its responsiveness to user needs. The flexible operating modes (TS_SHRINK_MODE_ANALYZE, TS_SHRINK_MODE_ONLINE, TS_SHRINK_MODE_AUTO, TS_SHRINK_MODE_OFFLINE) and control parameters like target_size offered through the DBMS_SPACE.SHRINK_TABLESPACE procedure provide DBAs with the means to perform shrink operations in a manner best suited to their environment’s requirements and downtime tolerances.

Oracle Enterprise Manager (EM), with tools like “Extent Map” and “Segment Advisor,” offers valuable contributions for performing detailed analysis before a shrink operation and for visualizing the state of tablespaces. During and after the operation, EM’s general monitoring capabilities can help track the impact on the system.

However, for the effective and safe use of the Tablespace Shrink feature, awareness of certain limitations and considerations is critically important. The existence of unsupported object types, the different behaviors of special object types like Index-Organized Tables (IOTs) and LOB segments during the shrink process, the impacts of online and offline modes on performance and availability, the unique shrink dynamics of Smallfile tablespaces, and potential error scenarios must be carefully evaluated by DBAs.

In conclusion, the Oracle 23ai Tablespace Shrink feature, when implemented with proper planning, comprehensive testing, and adherence to best practices, is an extremely powerful and valuable tool for DBAs to significantly improve database storage efficiency, reduce unnecessary storage costs, and enhance system performance. Continued developments by Oracle in this area may bring future innovations such as support for more object types, more advanced automation capabilities, or more comprehensive and direct management interfaces via EM. This feature stands out as one of Oracle’s innovative responses to modern data management challenges.

The post Oracle 23ai Tablespace Shrink appeared first on Bugra Parlayan | Oracle Database & Exadata Blog.

Oracle Application Continuity (AC & TAC)

Bugra Parlayan — Sat, 03 May 2025 09:54:19 +0000

Modern business applications are built on the foundation of uninterrupted service delivery and high availability. Planned or unplanned disruptions at the database layer can negatively impact the end-user experience and jeopardize business continuity. Oracle offers various technologies under the Oracle Maximum Availability Architecture (MAA) umbrella to overcome these challenges. One such critical feature, Oracle Application Continuity (AC), is designed specifically to ensure high availability at the application layer.

1.1 Definitions: Application Continuity (AC) and Transparent Application Continuity (TAC)

Application Continuity (AC) is an Oracle Database feature that enables the seamless and rapid replay of an in-flight request against the database following a recoverable error that makes the database session unusable. Its primary goal is to ensure that the interruption appears to the end-user as nothing more than a delay in request processing. AC works by completely reconstructing the database session after an outage, including all states, cursors, variables, and the last transaction (if any). This effectively masks disruptions caused by planned maintenance (e.g., patching, configuration changes) or unplanned outages (e.g., network errors, instance failures).

Transparent Application Continuity (TAC), introduced with Oracle Database 18c, is an extension or mode of AC. TAC transparently tracks and records session and transactional state, enabling the recovery of a database session after recoverable outages. The key characteristic of TAC is its ability to operate without requiring any application code changes or specific knowledge of the application by the database administrator (DBA). This transparency is achieved through a state-tracking infrastructure that categorizes session state usage.

Both AC and TAC can be used with Oracle Real Application Clusters (RAC), Oracle RAC One Node, Oracle Active Data Guard, and Oracle Autonomous Database (both shared and dedicated infrastructure). These features enhance the fault tolerance of systems and applications by masking database outages and recovering in-flight work that would otherwise be lost.

1.2 Problem Solved: Masking Interruptions and Ensuring Business Continuity

Without AC/TAC, database outages cause significant problems for applications. Applications receive error messages, users are left uncertain about the status of their transactions (e.g., money transfers, flight reservations, orders), and middleware servers might even need restarting to handle the surge of login requests post-outage. This leads to both end-user dissatisfaction and operational inefficiency.

AC and TAC enable the Oracle Database, Oracle drivers, and Oracle connection pools to collaborate, safely and reliably masking many planned and unplanned outages. By automatically handling recoverable errors, they improve the end-user experience and reduce the need for application developers to write complex error-handling code. This boosts developer productivity and aims for uninterrupted application operation.

The evolution from Oracle’s basic failover mechanisms (like TAF – Transparent Application Failover) to AC and then TAC reflects a strategic shift towards making high availability increasingly transparent and reducing application-specific coding dependencies. TAF (pre-12c) had significant limitations, especially around DML operations and session state management. AC (12c) addressed DML replay but required awareness of connection pool usage and request boundaries. TAC (18c+) further reduced complexity by automating state tracking and boundary detection. This progression shows Oracle recognized the adoption barriers of earlier solutions and prioritized ease of use alongside capability. Consequently, TAC has become Oracle’s preferred solution for modern applications, especially in cloud and Autonomous Database environments , while AC remains relevant for specific legacy systems or customization needs.

1.3 Role within Oracle Maximum Availability Architecture (MAA)

AC and TAC extend Oracle’s MAA principles to the application tier. MAA is a set of best practices, configurations, and architectural blueprints designed to achieve zero data loss and zero application downtime goals. AC and TAC contribute to these goals by recovering in-flight transactions and the application stack.

These features work in conjunction with other Oracle HA solutions like RAC, Data Guard, and Fast Application Notification (FAN) to form the building blocks for continuous availability. The MAA framework aims to keep applications continuously available by hiding planned and unplanned events, as well as load imbalances at the database tier. AC and TAC are integral parts of this architecture, minimizing the impact of database outages on the application.

2. Core Concepts and Working Mechanism

The fundamental principle behind Application Continuity is to recover ongoing work during an interruption and allow it to continue without the user noticing. This is achieved through a complex replay process, accurate definition of request boundaries, and meticulous management of session state.

2.1 The Replay Process: How AC/TAC Recovers Sessions

The working mechanism of AC and TAC involves the following steps when a recoverable error is detected:

Error Detection: The system identifies a recoverable error (e.g., network interruption, temporary instance failure) that renders the session unusable.
New Session Establishment: A new database session is established on another available database instance.
Session State Restoration: The state of the original session before the interruption (non-transactional state, variables, PL/SQL package states, etc.) is reconstructed in the new session. This is managed through service parameters like FAILOVER_RESTORE and SESSION_STATE_CONSISTENCY, and mechanisms like Database Templates in 23ai.
Replay of Database Calls: The database calls (SQL queries, DML operations) made from the beginning of the interrupted request are executed sequentially in the new session.
Consistency Check and Idempotence: During replay, data consistency is checked. The Transaction Guard mechanism ensures that the transaction is committed only once (idempotence), especially if the interruption occurred during the COMMIT operation.
Continuation or Error: If the replay is successful, the application perceives the interruption merely as a delay and continues from where it left off. However, if data inconsistency is detected during replay (e.g., a replayed query returns different results) or an unrecoverable state is encountered, the replay is rejected, and the application receives the original error. Unrecoverable errors (e.g., invalid data input) are never replayed.

This process ensures that the user is unaffected by the interruption and the transaction is either completed safely or the original error state is accurately reported.

2.2 Understanding Request Boundaries

A “request” is a logical unit of work from the application’s perspective. Defining the start and end points of these work units, known as request boundaries, is critical for the correct functioning of AC and TAC. These boundaries define the scope of work to be replayed and allow the system to discard unnecessary call history, using resources efficiently.

Typical Boundary: Often, a request boundary spans the time between an application borrowing a database connection from a connection pool (checkout) and returning it (check-in). This is the default behavior for ODP.NET and Oracle connection pools.
Explicit Boundaries: If Oracle connection pools are not used, or if AC is managed manually, the application must explicitly mark the request boundaries. This is usually done via driver-provided API calls like BeginRequest and EndRequest (or equivalents). This method improves resource consumption and ensures replay occurs within the correct scope.
Implicit/Discovered Boundaries (TAC): A significant advantage of TAC is its ability to automatically detect request boundaries. Especially with modern drivers (JDBC 18c+, OCI 19c+), TAC can determine boundaries by monitoring the application’s behavior. Conditions for discovering a boundary typically include no open transaction, cursors being closed or cached, and the session state being restorable.
Importance: Besides defining the replay scope, request boundaries are fundamental for functions like connection draining during planned maintenance, load balancing, and resource management.

2.3 Session State Management and Restoration (`FAILOVER_RESTORE`)

For a successful replay, the state of the original session before the interruption must be consistent with the state of the new session where the replay occurs. This includes not only the in-flight transaction but also non-transactional session settings, PL/SQL package states, temporary objects, and other session attributes.

Oracle provides various mechanisms and service parameters to manage this state:

FAILOVER_RESTORE Service Parameter: Determines the extent to which session state is restored after failover.
- LEVEL1 (or BASIC): Available for TAF and AC since Oracle Database 12.2, this setting restores commonly used basic initial session states (e.g., NLS settings).
- AUTO (TAC): The recommended setting used with TAC. It enables automatic tracking, validation, and restoration of session state.
SESSION_STATE_CONSISTENCY Service Parameter: Controls how session state is handled during a request.
- DYNAMIC (Older default for AC): If non-transactional session state changes during the request, replay is internally disabled until the next request.
- STATIC (Older, limited support for AC): Assumes the application does not change non-transactional state during the request. As of 23ai, its use with FAILOVER_TYPE=TRANSACTION is not supported.
- AUTO (TAC): State is tracked and validated transparently. After a disablement, failover is automatically re-enabled when possible. Recommended for TAC.
Restoring Application-Specific State: If there are application-specific initial states not covered by FAILOVER_RESTORE=LEVEL1 (e.g., custom PL/SQL package variables), additional mechanisms are needed:
- Connection Initialization Callbacks (Java): The application can register a callback function to be invoked when a connection is obtained or during replay.
- TAF Callbacks (OCI – Legacy): A similar mechanism for OCI applications.
- UCP/WLS Connection Labeling: State management can be achieved by assigning labels to connections and defining callbacks that initialize state based on these labels.
- It is crucial that these callback mechanisms are idempotent, meaning they produce the same result if run multiple times, as an outage could occur during the callback itself.
Oracle 23ai Database Templates: This new feature introduced in 23ai provides more advanced checkpointing and restoration of session state, enhancing the scope and reliability of AC.

The complexity of session state management has been a significant barrier to AC/TAC adoption. The evolution from manual callbacks to FAILOVER_RESTORE=LEVEL1 , then SESSION_STATE_CONSISTENCY=AUTO , and finally Database Templates demonstrates Oracle’s continuous effort to automate and simplify this critical aspect. Inconsistent session state between the original and replayed session is a primary cause of replay failure. TAC’s AUTO setting and 23ai’s Templates aim to make state management transparent, significantly increasing the likelihood of successful replay and broadening applicability. However, this implies that applications with complex, non-standard session state might still require careful design or potentially fall back to AC with custom callbacks. Understanding the application’s state usage is crucial for selecting the right approach and configuration. The addition of the RESET_STATE feature further underscores the need to manage state cleanly between requests.

2.4 Transactional State and Idempotence

AC and TAC aim to preserve the integrity of the last transaction during the replay of an interrupted request. This becomes critical, especially when an interruption occurs after the COMMIT command is sent but before the acknowledgment is received. This is where Transaction Guard (TG) comes into play.

TG determines the definitive outcome (COMMIT_OUTCOME) of the transaction, preventing the same transaction from being committed multiple times during replay. AC and TAC rely on this idempotence guarantee provided by TG to perform the replay safely. The detailed mechanism of Transaction Guard is discussed in Section 4.

3. Essential Enabling Components

The seamless operation of Application Continuity necessitates an ecosystem approach, requiring the coordinated function of the database, drivers, connection pools, and notification mechanisms.

3.1 Transaction Guard (TG): Guaranteeing Definitive Commit Outcomes

Transaction Guard is a cornerstone of AC/TAC. It prevents duplicate transactions by ensuring at-most-once execution during replay. AC and TAC depend on TG to determine the transaction’s status post-outage and ensure safe replay. (Details in Section 4).

3.2 Application Continuity Aware Drivers

For replay to occur, the client-side Oracle drivers must support Application Continuity and be capable of capturing database operations for potential replay. Key supported drivers include:

JDBC: Oracle JDBC Replay Driver. Version 12c or later for AC, 18c or later for TAC. With the Oracle 23ai driver, AC support is automatically enabled.
OCI (Oracle Call Interface): OCI Session Pool. Version 12.2 or later for AC, 19c or later for TAC.
ODP.NET: Unmanaged Provider (in pooled mode). Version 12.2 or later for AC, 18c or later for TAC. Core and Managed ODP.NET support was added in later releases.
SQL*Plus: Version 19c (specifically 19.3) or newer supports AC/TAC.
Others: Support for languages like Python, PHP, Node.js is typically provided via the respective OCI or JDBC drivers.

Unsupported drivers or configurations include Asynchronous ODP.NET , older drivers , JDBC OCI Type 2 driver, OLE DB, ODBC, OCCI, and pre-compilers.

3.3 Connection Pools

Connection pools play a critical role in the effectiveness of AC and TAC. Pools manage the lifecycle of connections, which simplifies the determination of request boundaries. They also integrate with FAN/FCF to respond quickly to outage notifications and manage connections. Major supported pools include:

Oracle Universal Connection Pool (UCP) (12c+)
WebLogic Server (WLS) Active GridLink (12c+)
Third-party JDBC application servers using UCP (e.g., JBoss, HikariCP)
OCI Session Pool
ODP.NET Connection Pool

Best practice dictates returning connections to the pool immediately after each request completes. Holding connections unnecessarily hinders draining during planned maintenance and compromises high availability.

3.4 Fast Application Notification (FAN) and Fast Connection Failover (FCF)

Fast Application Notification (FAN) is an Oracle Clusterware mechanism that publishes event notifications about the status of cluster and database services (e.g., instance crash, service start/stop, load balancing advisories).

FAN’s critical role is providing immediate notification about outages. This allows clients and connection pools to react swiftly instead of waiting for TCP/IP timeouts. FAN is a mandatory component for effective AC/TAC.

Fast Connection Failover (FCF) is a client-side feature, typically embedded within connection pools (UCP, WLS Active GridLink), that subscribes to FAN events. FCF uses the received FAN events to perform actions like:

Immediately terminating or removing connections belonging to failed instances from the pool.
Initiating connection draining for planned maintenance.
Performing runtime connection load balancing.

FAN events are transported via the Oracle Notification Service (ONS), typically requiring port 6200 to be open.

The tight coupling between AC/TAC, specific drivers, connection pools, and FAN/FCF underscores that Application Continuity is not just a database feature but an ecosystem requiring coordinated configuration across tiers (client, mid-tier, database). Replay requires driver intelligence. Request boundaries are often managed by pools. Fast failure detection relies on FAN/FCF. Without all pieces working together (correct driver versions, pool configuration, network paths for ONS, service settings), AC/TAC will not function effectively, or at all. This necessitates a holistic view and collaboration between DBAs, application developers, and potentially network administrators for AC/TAC implementation. Simply enabling a service feature is insufficient. The checklist approach in MAA documentation reinforces the need for comprehensive configuration. While the introduction of automatic ONS configuration aims to simplify this, understanding the components remains crucial.

4. Deep Dive: Transaction Guard (TG)

Transaction Guard is a fundamental technology offered by Oracle Database that significantly enhances application reliability, especially after interruptions. It is the key mechanism behind the safe and automated replay capability of Application Continuity.

4.1 Purpose: Ensuring At-Most-Once Execution

The core problem TG solves is the uncertainty that arises when the COMMIT acknowledgment sent to the client is lost following a recoverable error (like a network outage). The application cannot know if the transaction actually succeeded. Retrying the transaction in this state of uncertainty could lead to the same transaction being executed multiple times (duplicate transaction), causing logical data corruption.

TG addresses this by providing idempotence. In this context, idempotence is the ability to guarantee that a transaction, if retried after an error, is executed at most once. TG enables the application to learn the definitive outcome of the last transaction before the interruption, thereby preventing logical corruption caused by duplicate transactions.

4.2 How It Works: Logical Transaction ID (LTXID)

At the heart of Transaction Guard is a globally unique identifier called the Logical Transaction ID (LTXID). The working principle involves these steps:

LTXID Assignment: When a database session is established, it is automatically assigned an LTXID. This ID typically consists of the session’s logical number and a commit number that increments with each COMMIT or ROLLBACK within the session.
LTXID Tracking: The database tracks the LTXID for each transaction within the session. A copy of the LTXID is held both on the client (in the OCI session handle or JDBC/ODP.NET connection object) and on the server.
Commit Outcome Association: When a transaction is committed, its outcome (success/failure) is associated with the corresponding LTXID, and this information is persistently stored in the database.
Reliable Outcome Retrieval: When an interruption or error occurs and the COMMIT acknowledgment is lost, the application (or AC) uses the LTXID held on the client for the failed session.
GET_LTXID_OUTCOME Call: The application invokes the DBMS_APP_CONT.GET_LTXID_OUTCOME PL/SQL procedure, passing this LTXID.
Outcome Return: Based on the stored LTXID information, the database returns the definitive outcome of the transaction (committed/not committed, completed/not completed) to the application. This information allows the application to safely decide whether to retry the transaction.
At-Most-Once Enforcement: When the outcome is requested using the LTXID, the database can block any earlier in-flight transaction with the same LTXID from committing, thus ensuring at-most-once execution.
Retention Period: The database retains the LTXID and its associated commit outcome for a configurable duration (default 24 hours, set by the RETENTION_TIMEOUT service parameter), giving applications sufficient time for recovery and outcome querying.

4.3 Benefits and Use Cases

The primary benefits provided by Transaction Guard include:

Provides definitive commit outcomes.
Prevents logical corruption by avoiding duplicate transactions.
Enables safe transaction replay.
Forms the foundation for Application Continuity.
Improves user experience and reduces support costs.
Increases developer productivity.

Use cases include critical applications where duplicate transactions are unacceptable (banking, e-commerce order systems, etc.) , enabling AC/TAC, and allowing applications to safely implement their own custom recovery logic.

4.4 Configuration Requirements

To enable and use Transaction Guard, the following steps are required:

Database Version: Oracle Database 12.1 or newer must be used.
Application Service: All database work must go through a specifically created application service. The default database service should not be used. The service is created with srvctl for RAC or DBMS_SERVICE for single instance.
COMMIT_OUTCOME Parameter: The COMMIT_OUTCOME parameter must be set to TRUE on the application service.
Grant Permission: EXECUTE privilege on the DBMS_APP_CONT package must be granted to the application users who will call the GET_LTXID_OUTCOME procedure.
DDL_LOCK_TIMEOUT (Optional): If TG is to be used with DDL statements, increasing the DDL_LOCK_TIMEOUT parameter might be considered.
Recommendations: FAN configuration (for RAC/Data Guard), checking RETENTION_TIMEOUT, using connection pools.

4.5 Relationship with Application Continuity

Transaction Guard is a foundational technology underlying Application Continuity. AC/TAC internally and automatically uses TG when performing replay after an outage to reliably determine the status of the previous transaction.

Thanks to this integration, developers using AC/TAC generally do not need to interact directly with TG APIs; AC/TAC manages this process behind the scenes. However, if an application needs to implement its own custom recovery or replay logic, TG can also be used independently.

In essence, Transaction Guard lays the groundwork for safe replay by addressing the fundamental problem of uncertainty during commit in distributed systems. AC/TAC builds upon this foundation to offer an automated and transparent application continuity solution. Understanding TG is important not only for AC/TAC but also for building robust Oracle applications that need to handle failures during transactions.

5. Application Continuity (AC) vs. Transparent Application Continuity (TAC)

Oracle offers two primary mechanisms for ensuring application continuity: Application Continuity (AC) and Transparent Application Continuity (TAC). While both serve the same fundamental purpose, they differ significantly in their operation, configuration requirements, and impact on the application. Choosing the right solution depends on the application’s architecture, the technologies used, and the desired level of transparency.

5.1 Key Differences Summarized

The following table summarizes the core differences between AC and TAC:

Feature	Application Continuity (AC)	Transparent Application Continuity (TAC)
Transparency Level	Lower (Requires pool/boundary awareness, potential code changes for state/side-effects)	Higher (Aims for zero code changes, automatic state/boundary management)
Request Boundaries	Explicit (App/Pool defined) or Implicit (Pool defined)	Implicit/Discovered (Driver/Database detects)
Session State Management	Requires `FAILOVER_RESTORE=LEVEL1` + potential Callbacks/Labeling	Uses `FAILOVER_RESTORE=AUTO`, `SESSION_STATE_CONSISTENCY=AUTO`
Side Effect Handling (Default)	Replays side effects	Does not replay side effects
Customization (Callbacks, Side Effects)	Yes (Allows callbacks, explicit side-effect replay)	No (Designed for transparency, avoids complex customization)
DB Version Introduced	12.1 (JDBC), 12.2 (OCI/ODP.NET)	18c/19c
Key Service Setting (`FAILOVER_TYPE`)	`TRANSACTION`	`AUTO`

This table provides a concise, direct comparison of the most critical differentiating factors between AC and TAC, derived from synthesizing information across numerous citations, aiding user clarity and decision-making.

5.2 Transparency and Configuration Effort

TAC’s primary design goal is to require minimal or zero changes to application code. Features like automatic state tracking and request boundary discovery mean TAC configuration is generally simpler than AC, often achieved by using AUTO values in service parameters.

In contrast, AC might necessitate code adjustments, particularly when using older oracle.sql.* concrete classes , when custom session state management is needed , or when request boundaries must be manually defined if not using a connection pool.

5.3 Handling Side Effects (Non-Idempotent Operations)

Side effects are actions that occur outside the main database transaction and leave persistent results. Examples include sending emails via UTL_SMTP, writing to the file system, calling external web services, or autonomous transactions.

AC’s Default Behavior: AC replays statements with side effects by default. This can lead to undesired outcomes in some cases (e.g., sending the same email twice). AC provides mechanisms to manage this, such as disabling replay for specific code blocks (disableReplay API ) or potentially using callbacks for custom handling.
TAC’s Default Behavior: TAC does not replay side effects by default. TAC automatically detects calls known to have side effects and prevents their replay. This provides a safer default behavior for most applications.

The ACCHK tool can report on non-replayable side effects.

5.4 Customization Capabilities (Callbacks, Initial State)

AC allows customization for complex initial session state setups via connection initialization callbacks or connection labeling. This is useful when the application needs to be brought to a specific initial state post-failover.

TAC generally does not support such customizations, as its core philosophy is transparency and automation.

5.5 When to Choose AC vs. TAC?

TAC: Is the default and recommended solution for most modern applications, especially where code changes are undesirable or not feasible. It’s simpler and leverages automatic features.
AC: May be preferred or required in the following situations:
- Using older driver or database versions (pre-18c/19c for TAC).
- Needing fine-grained control over side-effect replay (e.g., intentionally replaying or suppressing specific side effects).
- Requiring complex initial state setup via callbacks.
- The application uses state patterns that cannot be automatically managed by TAC (e.g., persistent temporary tables not cleaned up between requests).

The choice between AC and TAC is not purely technical; it reflects a trade-off between transparency/simplicity (TAC) and control/customization (AC). TAC lowers the barrier to adoption by automating many complex aspects. However, this automation comes at the cost of flexibility. AC provides hooks for developers to handle edge cases or specific requirements that don’t fit the TAC model. Architects need a thorough understanding of the application’s behavior regarding session state and side effects to make the right choice. Migrating an application designed for AC to TAC might require analysis to understand if the default TAC behavior (e.g., not replaying side effects) is acceptable. The existence of both options caters to a wider range of application architectures and legacy constraints.

6. Requirements and Compatibility

Successfully implementing Application Continuity depends on meeting a set of requirements spanning database and client versions, service configuration, and application design. Incompatible or incomplete configurations can lead to the feature not working as expected, or not working at all.

6.1 Database and Client Version Prerequisites

Specific minimum Oracle database and client versions are required for AC, TAC, and related technologies to function. The table below summarizes these requirements:

Component	AC Requirement	TAC Requirement	TG Requirement	23ai Features
Database	12.1+	18c+ (19c+ recommended)	12.1+	23ai
JDBC Driver	12c+ Replay	18c+ Replay (19c+ rec.)	12.1+ Thin	23ai (auto-on)
OCI Client/Driver	12.2+	19c+	12.1+	23ai
ODP.NET Unmanaged (Pooled)	12.2+	18c+	12.1+	23ai (full)
*SQLPlus**	19.3+	19.3+	–	23ai

This table centralizes critical version compatibility information necessary for planning deployments or upgrades, preventing users from attempting unsupported combinations.

Using the latest client drivers is always strongly recommended for best results and full feature support. Mismatches between driver and database versions can lead to issues.

6.2 Essential Service Configuration Parameters

Using application-specific services managed by Oracle Clusterware (or created with DBMS_SERVICE for single instance) instead of the default database service is mandatory for AC and TAC. These services must be configured using the srvctl modify service (or add service) command with the following key parameters:

-failovertype: Determines replay behavior. Use TRANSACTION for AC, AUTO for TAC.
-commit_outcome: Enables Transaction Guard. Must be set to TRUE for AC and TAC.
-failover_restore: Sets the session state restoration level. Typically LEVEL1 for AC, AUTO for TAC.
-session_state_consistency: Defines the session state consistency mode. AUTO is recommended for TAC.
-replay_init_time: Specifies the maximum time (in seconds) allowed for replay to begin. A crucial tuning parameter.
-retention_timeout: Determines how long (in seconds) the Transaction Guard commit outcome is retained.
-drain_timeout: Specifies the time (in seconds) allowed for draining active sessions during planned maintenance.
-stopoption: Defines how the service is stopped during planned maintenance (e.g., IMMEDIATE).
-notification: Determines if FAN events are published for the service (should be set to TRUE).
-rlbgoal, -clbgoal: Set runtime and connection-time load balancing goals.
-failoverretry, -failoverdelay: Connection retry settings.

6.3 Application Design Considerations

The application’s design and coding practices directly impact the effectiveness of AC/TAC:

Pooling: Using Oracle connection pools (UCP, WLS Active GridLink, OCI/ODP.NET pools) is strongly recommended. Connections should be returned to the pool immediately after use.
Request Boundaries: Clear request boundaries must be ensured, either implicitly via pools or explicitly via API calls if necessary.
Statelessness: Aim for stateless application logic between requests whenever possible. If state exists, ensure it is managed correctly (restorable via FAILOVER_RESTORE or callbacks for AC). The RESET_STATE feature in 23ai can also assist here.
Error Handling: Applications must still have robust error handling for unrecoverable errors not handled by AC/TAC.
Avoid Legacy Concrete Classes: JDBC applications should avoid using legacy concrete classes from the oracle.sql.* package, preferring standard JDBC or oracle.jdbc.* interfaces. ACCHK can detect these.
Mutable Functions: Understand how AC/TAC handles mutable functions like SYSDATE, SYS_GUID, sequence.NEXTVAL (values are preserved during replay). Grant KEEP privileges if necessary.

6.4 Supported and Unsupported Configurations/Operations

Supported: Standard SQL and PL/SQL operations with compatible drivers, pools, and database versions are generally supported. XA transactions are supported with specific replay data sources (OracleXADataSourceImpl).
Unsupported/Limitations:
- Asynchronous ODP.NET.
- Specific drivers (JDBC Type 2 OCI, OLE DB, ODBC, OCCI, Pro*C, etc.).
- LONG/LONG RAW data types: Using LOBs is recommended. Can cause replay failures if not handled carefully due to their streaming nature.
- Certain session state changes or PL/SQL calls might temporarily disable replay.
- Operations using legacy oracle.sql.* classes.

Successful AC/TAC deployment hinges on meeting a cascade of prerequisites involving database version, client version, specific driver types, connection pooling strategy, and service configuration. It is not a plug-and-play feature. Each component (Database, driver, pool, FAN) has evolved with AC/TAC capabilities. Using mismatched versions can lead to partial functionality or complete failure. Service parameters directly control AC/TAC behavior. Application design choices (pooling, state management) interact directly with AC/TAC mechanisms. Therefore, a comprehensive checklist approach covering all tiers is essential before implementation. Overlooking any requirement (e.g., using an unsupported driver, not configuring FAN, incorrect service settings) will likely lead to failed replays and frustration. The version requirements also imply that leveraging the latest AC/TAC features often necessitates upgrading both server and client components.

7. Benefits, Limitations, and Considerations

While Application Continuity offers significant availability advantages for applications relying on Oracle databases, it also has specific limitations and requires careful planning and configuration for successful implementation.

7.1 Advantages of Implementing AC/TAC

Outage Masking: Hides both planned maintenance (patching, configuration) and unplanned outages (instance, network, storage failures) from applications and end-users.
Improved User Experience: Users experience only brief delays instead of error messages during interruptions, increasing satisfaction.
Increased Application Availability: Minimizes downtime, ensuring business continuity.
Developer Productivity: Reduces the need for complex error-handling code for recoverable errors.
Transaction Integrity (Idempotence): Prevents duplicate transaction commits during replay, thanks to Transaction Guard.
TAC Transparency: TAC offers easy configuration without requiring code changes.
Broad Platform Support: Supports various platforms like Java,.NET, Python, and common connection pools.

7.2 Potential Drawbacks, Limitations, and Common Pitfalls

Recoverable Errors Only: Only handles recoverable database errors like network issues or instance failures. Unrecoverable errors originating from application logic or invalid data must still be managed by the application.
Compatibility Requirements: Requires specific and compatible Oracle database and client versions, dedicated drivers, and connection pools (See Section 6).
Configuration Complexity: Needs careful service configuration, FAN setup, and potentially application adjustments. Misconfiguration leads to replay failure.
Session State Management: Restoring session state correctly can be complex, especially for AC or stateful applications. Unrestorable state prevents replay.
Performance Impact: State tracking and potential replay can introduce some performance overhead, although usually minimal. Tuning parameters like replay_init_time is important.
Debugging Challenges: Diagnosing replay failures can be difficult, requiring tracing and tools like ACCHK.
Non-Replayable Operations: Not all operations are replayable (e.g., those using certain legacy classes , some complex PL/SQL or external calls if not handled carefully ).
Application Assumptions: Code assuming ROWIDs don’t change or relying on middle-tier timing might encounter issues.

7.3 Understanding and Managing Side Effects

Managing side effects (actions outside the database transaction: email, file writes, etc. ) is a key difference between AC and TAC and requires careful consideration.

Default Behaviors: AC replays side effects by default, while TAC does not.
Strategies for AC:
- If replay of the side effect is acceptable (it’s idempotent or business logic allows), the default behavior can be used.
- To prevent unwanted replay, disable replay for specific code blocks using the disableReplay() API.
- Callbacks can be used for custom logic, though complex.
Strategy for TAC: Rely on automatic detection and suppression. If a side effect must be replayed, TAC might not be suitable, and AC should be considered.
Analysis: Identifying side effects during application analysis is crucial.

While AC/TAC offers significant availability benefits, they are not a panacea. Their effectiveness depends heavily on the application’s architecture, the nature of the failure, and meticulous configuration and validation. The limitations make it clear that AC/TAC targets a specific class of problems. They do not fix application bugs or handle every possible state. The pitfalls around configuration and state indicate that implementation requires expertise and testing. Side effect management remains a key differentiator and potential point of complexity. Therefore, setting realistic expectations is vital. Organizations should view AC/TAC as powerful tools within a larger HA strategy , not a complete replacement for robust application design and error handling. Thorough testing and validation with tools like ACCHK are non-negotiable before production deployment.

8. Integration with Oracle High Availability Solutions

While Application Continuity is a valuable feature on its own, its power and effectiveness are maximized when integrated with Oracle’s other high availability solutions. Particularly when used alongside technologies like Oracle RAC and Active Data Guard, it provides multi-layered protection, creating a comprehensive defense against application interruptions.

8.1 AC/TAC and Real Application Clusters (RAC)

Oracle RAC allows a single database to run across multiple servers (nodes), providing instance-level redundancy and scalability. AC/TAC plays a critical role in RAC environments in the following ways:

Failover Target: When a RAC node or instance fails, AC/TAC directs the session to another surviving instance in the cluster and initiates the replay process.
Clusterware Services and FAN Integration: In RAC, database services are managed by Oracle Clusterware. These services distribute workloads, define preferred and available instances. FAN, as an integral part of Clusterware, instantly broadcasts changes in node or service status (crash, start, stop). AC/TAC uses these FAN notifications for rapid failure detection and triggering the failover process.
Planned Maintenance and Load Balancing: RAC services are used to manage workload redirection or draining during planned maintenance (e.g., rolling patching). AC/TAC helps ensure continuity during these processes. Furthermore, load balancing information provided via FAN allows connection pools to direct new connections to less loaded instances.
23ai Enhancements: Features introduced in Oracle 23ai, such as Smart Connection Rebalance and Smooth Reconfiguration of RAC Instances, further enhance overall application availability in RAC environments.

8.2 AC/TAC and Active Data Guard (ADG)

Oracle Active Data Guard provides disaster recovery and data protection by creating one or more synchronized physical standby copies of a primary database. The standby can also be used for read-only queries (Active). AC/TAC is also supported in ADG environments:

Disaster Recovery Scenarios: When the primary database becomes completely unavailable (e.g., site disaster), Data Guard performs an automatic or manual failover or switchover to the standby database.
Role-Based Services: In ADG environments, services are typically defined based on a specific role (PRIMARY or STANDBY). An AC/TAC-enabled service is defined to run on the database holding the primary role.
Post-Failover Replay: After the role transition (failover/switchover) completes, the application service starts on the new primary database. AC/TAC directs client connections to this new primary and attempts to rebuild and replay the interrupted session and transaction there.
Data Loss Consideration: Crucially, if data loss occurred during the Data Guard role transition, AC/TAC will not attempt replay. Therefore, running Data Guard in Maximum Availability or Maximum Protection mode is preferred for seamless failover with AC/TAC.

8.3 Relationship with Connection Draining for Planned Maintenance

The primary mechanism for achieving zero downtime during planned maintenance (e.g., database or OS patching, hardware upgrades) is connection draining.

Process: Services on the instance undergoing maintenance are stopped or relocated with a specific drain timeout (drain_timeout) (srvctl relocate service or stop service). FAN notifies connection pools of this status. Pools stop giving new connections to the instance marked for draining and close existing idle connections. Active connections are closed when they finish their work and are returned to the pool.
AC/TAC’s Role: AC/TAC acts as a backup mechanism during the draining process. If a session cannot complete its work within the defined drain_timeout and its connection is forcibly terminated, AC/TAC attempts to replay that session on a surviving instance. This helps ensure continuity even if the drain timeout is insufficient or for unexpectedly long-running transactions.

AC/TAC is most potent when deployed within a comprehensive MAA framework involving RAC and/or Active Data Guard, leveraging the redundancy and failover capabilities of the underlying infrastructure. RAC provides immediate local failover targets. ADG provides disaster recovery failover targets. FAN, integral to Clusterware/Data Guard Broker, provides the necessary rapid notifications. Draining handles planned events gracefully. AC/TAC fills the gap by handling the application session recovery during these infrastructure events. While AC/TAC can provide some benefit on a single instance , its true value emerges in clustered or replicated environments where alternative processing resources are readily available. The synergy between these technologies (RAC/ADG + FAN + Pools + AC/TAC) creates a multi-layered defense against disruption.

9. Enhancements in Oracle Database 23ai

Oracle continues to invest in Application Continuity and related high availability technologies. The Oracle Database 23ai release introduces significant new features and improvements in this area, expanding the capabilities of AC/TAC and simplifying its usage.

9.1 Overview of Key HA/Continuity Features in 23ai

Oracle 23ai includes a range of innovations focused on enhancing continuous availability. The main enhancements directly or indirectly related to Application Continuity are:

23ai Feature	Description	Benefit
AC Session State Restore (Database Templates)	Uses templates to checkpoint/restore session state for AC replay/migration.	Simplifies/broadens AC use, improves state restore reliability, reduces planned downtime.
AC Batch Support (Resumable Cursors)	TAC automatically manages and allows replay of long-running cursors common in batch jobs.	Extends AC/TAC protection to batch workloads.
JDBC Auto-Enabled AC	AC support is enabled by default in 23ai JDBC drivers; only requires AC-enabled service.	Lowers the entry barrier for Java applications.
JDBC True Cache Integration	JDBC driver can route read-only workloads to True Cache instances.	Improves performance/scalability (indirectly related to replay).
Database Native Transaction Guard	Persists LTXID as part of commit, reducing overhead compared to separate table.	Improves TG performance, requires no client changes.
Smart Connection Rebalance	Automatically moves sessions between RAC instances based on performance.	Transparently improves performance/resource utilization.
Smooth Reconfiguration of RAC Instances	Reduces downtime when nodes join/leave a RAC cluster.	Enhances continuous availability during cluster changes.
JDBC Self-Driver Diagnosability	Single production JAR; dumps in-memory trace on first failure.	Simplifies debugging and diagnostics for JDBC/AC issues.

This table directly addresses the user’s request for 23ai features, providing a quick summary of the most relevant enhancements impacting application continuity and high availability.

9.2 Database Templates for Session State Restoration

Prior to Oracle 23ai, restoring complex session states, especially with AC, could require mechanisms like manual callbacks or connection labeling. Database Templates, introduced in 23ai, significantly simplify and automate this process. These templates are used to periodically checkpoint the session state, both server-side and client-visible aspects. When an outage occurs and replay is needed, AC uses these templates to quickly and reliably restore the session state at the start of the replay. Enhancements in the JDBC driver allow these templates to be shared across sessions and manage state variations. This simplifies AC configuration and increases the likelihood of successful replay.

9.3 Support for Batch Applications (Resumable Cursors)

Traditionally, long-running database cursors (often used in batch jobs or reporting) could prevent AC/TAC from automatically detecting request boundaries, as the request wasn’t considered finished while the cursor was open. Oracle 23ai addresses this with TAC and the JDBC driver. With Resumable Cursors support, long-running cursors that meet certain criteria (e.g., open with no open transaction and restorable session state) no longer prevent the implicit determination of request boundaries. This extends TAC protection to batch-type workloads that might have previously been excluded.

9.4 JDBC Driver Enhancements (Auto-Enablement, True Cache)

The Oracle 23ai JDBC driver includes significant innovations simplifying AC usage:

Auto-Enablement: With the 23ai driver, Application Continuity support is enabled by default in all data sources. All that’s required for an application to benefit from AC is to connect to an AC/TAC-enabled database service. To disable AC, the new oracle.jdbc.enableACSupport=false connection property or system property can be used.
True Cache Integration: True Cache is an in-memory, consistent, read-only replica of the primary database. When enabled via the oracle.jdbc.useTrueCacheDriverConnection=true property, the 23ai JDBC driver can automatically route read-only workloads to appropriate True Cache instances, improving performance and scalability.
Self-Diagnosability: To simplify debugging, the 23ai JDBC driver records critical execution state in memory and dumps this recording when an error occurs, providing valuable diagnostic information on the first occurrence of the problem. The need for separate debug and metrics JAR files has also been eliminated.

9.5 Transaction Guard Improvements

Oracle 23ai introduces Database Native Transaction Guard. This enhancement persists the LTXID and commit outcome as part of the transaction’s commit record itself, rather than in a separate table. This eliminates the extra redo generation and performance overhead associated with Transaction Guard and requires no client-side changes.

9.6 Smart Connection Rebalance and Smooth Reconfiguration

While not direct AC/TAC mechanisms, these two features enhance overall availability and performance in RAC environments, making the platform AC/TAC runs on more stable:

Smart Connection Rebalance: Automatically redistributes sessions across RAC instances based on real-time performance, balancing load and optimizing performance.
Smooth Reconfiguration: Reduces the brief service interruptions (brownouts) experienced during RAC topology changes, such as nodes joining or leaving the cluster.

Oracle 23ai represents a significant leap forward in making Application Continuity more powerful, easier to manage, and applicable to a broader range of workloads (notably batch). Features like Templates , Resumable Cursor support , and JDBC Auto-enablement directly address previous limitations or complexities. Native TG improves efficiency. RAC enhancements strengthen the underlying platform. For organizations considering AC/TAC, 23ai offers compelling reasons to upgrade. It lowers adoption barriers and expands protection, aligning with trends towards autonomous operations and broader HA coverage. True Cache also signals a move towards optimizing read performance in HA architectures.

10. Configuration, Validation, and Best Practices

Transforming the theoretical benefits of Application Continuity into reality requires careful configuration, thorough validation, and adherence to established best practices. Ensuring the correct settings on both the database server and client-side, along with designing the application to be compatible with AC/TAC, is critically important.

10.1 Configuring Database Services for AC/TAC

Enabling AC/TAC and controlling its behavior is done through application services managed via Oracle Clusterware (for RAC) or the DBMS_SERVICE package (for single instance ). Key parameters set using srvctl add service or modify service commands include:

Example Configuration for TAC: srvctl add service -db -pdb -service GOLD_TAC \ -preferred -available \ -failovertype AUTO \ -commit_outcome TRUE \ -failover_restore AUTO \ -session_state_consistency AUTO \ -replay_init_time 1800 \ -retention 86400 \ -notification TRUE \ -drain_timeout 600 \ -stopoption IMMEDIATE This command uses the recommended AUTO settings for TAC (failovertype, failover_restore, session_state_consistency) and enables Transaction Guard (commit_outcome=TRUE). Parameters like replay_init_time (replay initiation timeout), retention (TG outcome retention time), and drain_timeout (draining timeout) should be tuned based on the workload.
Example Configuration for AC: srvctl add service -db -pdb -service GOLD_AC \ -preferred -available \ -failovertype TRANSACTION \ -commit_outcome TRUE \ -failover_restore LEVEL1 \ -replay_init_time 1800 \ -retention 86400 \ -notification TRUE \ -drain_timeout 600 \ -stopoption IMMEDIATE This command uses failovertype=TRANSACTION for AC and failover_restore=LEVEL1 for basic state restoration.

Careful tuning of parameters (especially replay_init_time and drain_timeout) based on workload and expected outage durations is essential.

10.2 Client-Side Configuration

Clients must connect correctly to AC/TAC-enabled services and be configured to leverage failover mechanisms:

TNS Connect String: The recommended connect string should include parameters for timeouts (CONNECT_TIMEOUT, TRANSPORT_CONNECT_TIMEOUT), retry count (RETRY_COUNT), and delay (RETRY_DELAY). For RAC environments, SCAN addresses (HOST= within ADDRESS_LIST) and SERVICE_NAME must be used (SID usage is unsupported and discouraged). Alias = (DESCRIPTION = (CONNECT_TIMEOUT=3)(RETRY_COUNT=4)(RETRY_DELAY=2)(TRANSPORT_CONNECT_TIMEOUT=3) (ADDRESS_LIST = (LOAD_BALANCE=on) (ADDRESS = (PROTOCOL = TCP)(HOST=)(PORT=1521)) (ADDRESS = (PROTOCOL = TCP)(HOST=)(PORT=1521))... ) (CONNECT_DATA= (SERVICE_NAME = GOLD_TAC) ) )
JDBC Configuration:
- DataSource: For pre-23ai drivers, oracle.jdbc.replay.OracleDataSourceImpl (or OracleXADataSourceImpl for XA) must be used. In 23ai, the standard OracleDataSource is sufficient as AC is auto-enabled.
- Pool Properties: In pools like UCP, properties like setFastConnectionFailoverEnabled(true) must be set to enable FCF.
- JAR Files: Ensure necessary Oracle JDBC driver (e.g., ojdbc11.jar), UCP (ucp.jar), and ONS (ons.jar) JAR files are in the CLASSPATH.
OCI/ODP.NET Configuration:
- oraaccess.xml: FAN and Runtime Load Balancing (RLB) settings can be configured via this file.
- Connection String Attributes: Properties like HA Events=true (enable FAN), Load Balancing=true (enable RLB), Pooling=true can be specified in the connection string.

10.3 Validating Coverage with ACCHK

Simply performing destructive testing (e.g., shutting down an instance) is insufficient to confirm correct configuration and understand the extent to which the application is protected by AC/TAC. Oracle provides the ACCHK (Application Continuity Checker) tool for this purpose.

Purpose: ACCHK analyzes and reports the AC/TAC protection level while a specific workload is running. It shows the protection percentage, number of protected/unprotected operations, reasons for lack of protection, and usage of legacy concrete classes.
Steps:
1. Create Views: Before first use, create ACCHK views (DBA_ACCHK_*) and the ACCHK_READ role in the database using execute dbms_app_cont_admin.acchk_views(); (COMPATIBLE must be >= 12.2).
2. Grant Privilege: Grant GRANT ACCHK_READ TO ; to users who need to read the reports.
3. Enable Tracing: Enable AC tracing for a specific duration using execute dbms_app_cont_admin.acchk_set(true, );. This can also be done at the database or session level with ALTER SYSTEM/SESSION SET EVENTS 'trace[progint_appcont_rdbms]';.
4. Run Workload: Execute the application workload to be validated while tracing is active.
5. Report/Query: After the tracing period ends (or is manually stopped with acchk_set(false)), query views like DBA_ACCHK_EVENTS, DBA_ACCHK_STATISTICS, or use tools like ORAchk to analyze the protection level and details.

Using ACCHK before deployment and after application changes is critical to verify the protection level and proactively identify potential issues.

10.4 Recommended Best Practices for Developers and DBAs

DBAs:
- Use application-specific services configured with correct parameters (FAILOVERTYPE, COMMIT_OUTCOME, etc.).
- Enable FAN and ensure client accessibility (e.g., port 6200).
- Monitor AC statistics (AWR, ACCHK views).
- Tune timeouts (replay_init_time, drain_timeout) based on workload.
- Apply recommended database patches.
Developers:
- Use Oracle connection pools and return connections promptly after use.
- Write stateless code where possible, or manage state carefully.
- Implement robust error handling for unrecoverable errors.
- Avoid legacy oracle.sql.* concrete classes.
- Understand and manage side effects (consider disableReplay for AC or TAC’s default behavior).
- Validate coverage with ACCHK.
General:
- Perform thorough testing in QA environments that mimic production.
- Use the latest drivers and patches.
- Follow Oracle MAA guidelines.

Configuration and validation are iterative processes, not one-time tasks. Tuning and verification are essential to realize the full benefits of AC/TAC. Default settings may not be optimal. Application behavior can change, potentially introducing unprotected calls. ACCHK provides the necessary visibility into actual protection levels, allowing informed adjustments and fixes before a real outage reveals weaknesses. Successful AC/TAC implementation requires ongoing diligence, involving not just initial setup but also performance monitoring , coverage validation , and potentially adjusting configuration or application code based on findings. Collaboration between DBAs and developers is key.

11. Conclusion

Oracle Application Continuity (AC) and its more transparent variant, Transparent Application Continuity (TAC), offer powerful solutions to the challenges of uninterrupted service and high availability faced by modern applications. These technologies mask the impact of planned and unplanned database outages from end-users and applications, aiming to make disruptions feel like nothing more than brief processing delays.

AC and TAC function by recovering interrupted database sessions, including their state and in-flight transactions, and automatically replaying them. Transaction Guard plays a crucial role in this process, guaranteeing that transactions are executed only once, thereby preserving data integrity. Fast Application Notification (FAN), along with compatible connection pools and drivers, are critical ecosystem components enabling rapid failure detection and seamless transition. While TAC provides transparent protection often without code changes , AC offers greater customization and control.

These technologies are particularly potent when deployed alongside other MAA components like Oracle RAC and Active Data Guard, creating a highly resilient and continuously available infrastructure. Innovations in Oracle 23ai, such as Database Templates, enhanced support for batch jobs, and auto-enablement in the JDBC driver, further advance the capabilities of AC/TAC and ease their adoption.

However, fully leveraging the benefits of AC/TAC requires more than just enabling a feature. Success depends on selecting the correct database and client versions, meticulous service configuration, designing applications according to best practices, and performing thorough validation using tools like ACCHK. Organizations should view AC/TAC as powerful tools within the MAA framework that complement robust application design and overall error management strategies. With proper configuration, continuous monitoring, and adherence to best practices, Oracle Application Continuity can play a critical role in achieving the level of application availability demanded by today’s challenging business requirements.

The post Oracle Application Continuity (AC & TAC) appeared first on Bugra Parlayan | Oracle Database & Exadata Blog.

Exadata Update Utilities: patchmgr and dbnodeupdate.sh

Bugra Parlayan — Sat, 26 Apr 2025 14:56:09 +0000

1. Introduction

Oracle Exadata Database Machine is a high-performance, optimized platform for Oracle Database workloads. Regularly updating the software components of this platform—including the operating system, Exadata system software, device drivers, and firmware—is crucial for addressing security vulnerabilities, fixing bugs, and leveraging new features. Oracle provides specialized utilities to manage these update processes. Two of the most commonly used tools are patchmgr and dbnodeupdate.sh. This document aims to provide a detailed technical comparison of these two utilities, explaining their functions, key differences, use cases, and parameters for effective Exadata patching.

2. The `patchmgr` Utility

2.1. Definition and Purpose

patchmgr is a centralized utility designed to orchestrate and simplify software updates for Oracle Exadata infrastructure components. It allows administrators to update multiple components—database servers, storage servers, and network switches—using a single command structure, streamlining the Exadata update process.

2.2. Scope and Capabilities

Broad Component Coverage:patchmgr can update various Exadata components :
- Oracle Exadata Storage Servers (Cells)
- Oracle Exadata Database Servers (Compute Nodes)
- RDMA Network Fabric Switches (RoCE Switches)
- InfiniBand Network Fabric Switches
- Management Network Switch (specific models)
Orchestration: It manages the update sequence across multiple targets, supporting both rolling and non-rolling updates.
- Rolling Update: Updates components sequentially (one by one) to maintain overall system availability, ideal for RAC clusters or storage server grids.
- Non-Rolling Update: Updates all specified components concurrently, which is faster but requires a complete system outage.
Automation: For database server updates, patchmgr automates numerous steps, including stopping/starting databases and Grid Infrastructure, managing VMs, handling Oracle Enterprise Manager agents, taking OS backups, relinking Oracle Homes, and applying best practice configurations.
Centralized Execution: patchmgr can be executed from a database server within the Exadata system being patched or from a separate, central server (the “driving system”) running Oracle Linux or Oracle Solaris. This facilitates managing multiple Exadata systems from one location.
User and Concurrency: Can be run by root or a non-root user (requires -log_dir). Multiple patchmgr instances can run concurrently from the same software directory (using distinct -log_dir values) to patch different systems simultaneously.

2.3. Platform Support

Target Systems: The Exadata components updated by patchmgr (database servers, storage servers) typically run Oracle Linux.
Driving System: The patchmgr utility itself can be executed from a server running either Oracle Linux or Oracle Solaris. This means you can initiate and manage the patching of Linux-based Exadata components from a Solaris management server.

2.4. Key Parameters and Usage

The general syntax for patchmgr is: ./patchmgr - - [optional_arguments]

-: Specifies the component type:
- -cells: For Storage Servers.
- -dbnodes: For Database Servers.
- -ibswitches: For InfiniBand Switches.
- -roceswitches: For RoCE Switches.
: A text file listing the hostnames of the components to be updated.
-: Specifies the operation:
- -upgrade: Performs a software upgrade to a specified version.
- -rollback: Rolls back to the previous software version.
- -precheck: Runs prerequisite checks before an upgrade or rollback.
- -backup: Performs a backup (typically for DB nodes).
[optional_arguments]: Modifies the action or behavior:
- --rolling / -rolling: Perform the action in a rolling fashion.
- --iso_repo / -iso_repo : Specifies the path to the patch ISO file.
- --target_version / -target_version : Specifies the target software version.
- --modify_at_prereq: Allows removal of conflicting RPMs during precheck to resolve dependencies (for DB nodes).
- --force_remove_custom_rpms: Forces removal of custom (non-Exadata) RPMs during an OS upgrade (for DB nodes).
- --log_dir : Specifies a directory for log files or uses automatic naming. Required for concurrent execution or non-root users.
- --allow_active_network_mounts: Allows patching to proceed even if active network mounts (like NFS) are detected.
- --dbnode_patch_base : Specifies the directory on target DB nodes where patch files will be extracted.
- --ignore_alerts: Proceeds with patching despite active hardware alerts.
- --sequential_backup: Backs up each node immediately before updating it during a rolling update (default is to back up all nodes first).
- --update_type : Selects security-only (allcvss) or full (full) update.
- --live-update-target: Utilizes Exadata Live Update (on supported versions).

3. The `dbnodeupdate.sh` Utility

3.1. Definition and Purpose

dbnodeupdate.sh is a shell script specifically used to update, roll back, or back up the software on a single Oracle Exadata database server (compute node). Before patchmgr provided orchestration for database servers, updates often involved manually running dbnodeupdate.sh on each node sequentially.

3.2. Scope and Capabilities

Single Database Server Focus: dbnodeupdate.sh operates only on the database server where it is executed.
Core Update Engine: When patchmgr updates database servers, it essentially invokes dbnodeupdate.sh on each target node to perform the actual update, rollback, or backup tasks.
Operating System Updates: It handles updates for the Oracle Linux OS, device drivers, and firmware included in the Exadata system software patch. It supports major OS upgrades (e.g., OL6 to OL7, OL7 to OL8), although older dbnodeupdate.sh versions might be needed for older OS transitions.
Backup and Rollback: Automatically creates a backup of the root filesystem before starting an update. This backup enables rollback to the previous state using the -r option if the update fails or needs to be reverted.
Dependency Management: Includes the -M option to allow the removal of conflicting RPMs during the prerequisite check phase to resolve dependency issues.

3.3. Platform Support

dbnodeupdate.sh runs on the Exadata database server being updated. Modern Exadata database servers exclusively use Oracle Linux. While Solaris was supported on older Exadata compute nodes , dbnodeupdate.sh in current contexts targets Linux.

3.4. Key Parameters and Usage

The basic usage is: ./dbnodeupdate.sh

-u: Initiates the update process.
-r: Initiates a rollback from the pre-update backup.
-b: Performs only the backup step.
-c: Runs the post-reboot completion steps after an update or rollback.
-l : Specifies the path to the update ISO file or the YUM repository URL.
-s: Stops Cluster Ready Services (CRS) before the update/rollback and restarts it afterward.
-p: Runs the bootstrap phase (typically for upgrades from older versions).
-x : Specifies the directory containing helper scripts (often used with -p).
-M: Allows removal of conflicting RPMs during prerequisite checks.
-v: Performs only the prerequisite check.

4. `patchmgr` vs. `dbnodeupdate.sh`: Key Differences

Feature	`patchmgr`	`dbnodeupdate.sh`
Primary Purpose	Orchestration for Exadata infrastructure components	Single Exadata DB server update/rollback/backup
Scope	DB Servers, Storage Servers, Network Switches	DB Servers Only
Execution Mode	Manages multiple targets; invokes `dbnodeupdate.sh` for DB nodes	Runs on a single target node
Update Mode	Rolling or Non-Rolling	Affects only the single node it runs on
Run Location	DB Node or separate Linux/Solaris server	Only on the target DB Node being updated
Typical Use Case	Standard multi-component/multi-node patching	Single node update/rollback; invoked by `patchmgr`; patching last node in rolling update; recovery scenarios
Driving Platform	Linux, Solaris	Linux (on the target DB Node)

5. Which Tool Should Be Used When?

Use patchmgr when:
- Updating multiple storage servers, database servers, or network switches (standard and preferred method).
- Choosing between rolling or non-rolling update strategies.
- Managing the update process from a central server.
- Leveraging automated orchestration steps (service stop/start, backup, relink).
Use dbnodeupdate.sh when:
- Updating or rolling back only one specific database server (e.g., testing on a single node).
- Performing a rolling update with patchmgr and needing to patch the initial driving node last (by running dbnodeupdate.sh locally on it after other nodes are done).
- Recovering a system after a failed update/rollback by booting from a diagnostic ISO.
- As a manual alternative if patchmgr itself encounters issues or is unavailable.

As a general rule, patchmgr is the standard utility for routine Exadata infrastructure patching. dbnodeupdate.sh should be considered an underlying component used by patchmgr for database nodes or a tool for specific single-node scenarios.

6. Conclusion

patchmgr and dbnodeupdate.sh are essential tools for maintaining the currency and security of Oracle Exadata platforms. patchmgr serves as the primary orchestration utility, simplifying the update process across multiple Exadata components (database servers, storage servers, switches) and supporting both rolling and non-rolling strategies. It can be driven from Linux or Solaris systems. dbnodeupdate.sh is the core script that performs the actual update, rollback, or backup on individual Linux-based Exadata database servers, often invoked by patchmgr but also usable standalone for specific single-node tasks or recovery situations. Understanding the distinct roles and capabilities of each tool allows administrators to choose the appropriate method for their specific Exadata maintenance requirements, with patchmgr being the standard choice for most patching operations.

The post Exadata Update Utilities: patchmgr and dbnodeupdate.sh appeared first on Bugra Parlayan | Oracle Database & Exadata Blog.

Comprehensive Guide to Oracle Exadata Automatic Hard Disk Scrubbing

Bugra Parlayan — Fri, 25 Apr 2025 16:48:27 +0000

I. Introduction: Overview of the Exadata Hard Disk Scrubbing Process

Data integrity is a cornerstone of modern computing systems. Errors that may occur during the storage, reading, transmission, and processing of data can have devastating effects on business processes. Various error detection and correction mechanisms have been developed to mitigate these risks. One such mechanism is the “data scrubbing” process.

A. Data Scrubbing: General Concept

Data scrubbing is an error correction technique that periodically inspects storage devices or main memory for errors and corrects detected errors using redundant data, such as checksums or backup copies of the data. Its primary purpose is to reduce the likelihood that single, correctable errors will accumulate over time and lead to uncorrectable errors. This ensures data integrity and minimizes the risk of data loss.

This technique is a widely used error detection and correction mechanism in memory modules (with ECC memory), RAID arrays, modern file systems like ZFS and Btrfs, and FPGAs. For example, a RAID controller can periodically read all hard disks in a RAID array to detect and repair bad blocks before applications access them, thereby reducing the probability of silent data corruption caused by bit-level errors.

B. Exadata Automatic Hard Disk Scrubbing: Definition and Scope

The Oracle Exadata platform employs a multi-layered approach to ensure data integrity. One of these layers is the Exadata Automatic Hard Disk Scrub and Repair feature. As part of the Exadata System Software (Cell Software), this feature automatically and periodically inspects the hard disk drives (HDDs) within the Storage Servers (Cells) when the disks are idle.

The primary goal of this process is to proactively detect and facilitate the repair of bad sectors or other physical/logical defects on the disks before applications attempt to access the affected data. This prevents “latent” or silent data corruption.

The scope of Exadata scrubbing is important. This feature primarily targets physical bad sectors on hard disks. It focuses on detecting physical media errors that might be missed by standard drive Error Correcting Code (ECC) mechanisms or operating system checks. This complements, but does not replace, higher-level logical consistency checks performed by the database (e.g., via the DB_BLOCK_CHECKING parameter ) or the manually executable ASM disk scrubbing process. Furthermore, this automatic scrubbing process does not apply to Flash drives in Exadata; these drives are protected by different mechanisms.

A distinctive aspect of Exadata scrubbing is its proactive nature. While database block checks typically occur during I/O operations , Exadata scrubbing specifically targets data that has not been accessed for a long time, especially when disks are idle. This approach ensures that corruption in rarely used data is detected and repaired long before it can cause an access error at a critical moment.

C. Differences Between Exadata Hard Disk Scrubbing and ASM Disk Scrubbing

The term “scrubbing” can be used in different contexts within the Oracle ecosystem, so it’s crucial to distinguish Exadata’s automatic hard disk scrubbing from the disk scrubbing feature offered by Oracle Automatic Storage Management (ASM).

Exadata Automatic Hard Disk Scrubbing:
- Scope: Operates at the Exadata Storage Server (Cell) level, managed by the Cell Software.
- Focus: Checks the integrity of physical sectors on hard disks.
- Operation: Runs automatically based on a schedule configured in CellCLI.
- Resource Usage: The checking process is local to the storage cell, consuming no CPU on database servers and generating no unnecessary network traffic during the check.
- Monitoring: Monitored via CellCLI metrics and Cell alert logs.
ASM Disk Scrubbing:
- Scope: Operates at the ASM disk group or file level, managed by ASM.
- Focus: Searches for logical corruptions within ASM blocks/extents.
- Operation: Typically triggered manually (via SQL*Plus or asmcmd) or through a script (e.g., a cron job ).
- Resource Usage: The process occurs at the ASM layer and can potentially consume database server resources and inter-cell network traffic.
- Monitoring: Monitored via the V$ASM_OPERATION view and ASM alert logs (alert_+ASM.log).

These two mechanisms are complementary. Exadata scrubbing finds physical errors, potentially preventing them from causing logical corruptions later, while ASM scrubbing can find logical inconsistencies that might arise from sources other than physical media errors (e.g., software bugs). Oracle documentation suggests that due to the presence of automatic Exadata scrubbing in Exadata 11.2.3.3 and later, periodic ASM disk scrubbing becomes less critical for the specific purpose of proactive physical/latent error checking. However, manual ASM scrubbing retains its value for on-demand logical validation of specific files or disk groups.

II. Internal Mechanism of the Exadata Scrubbing Process

The effectiveness of the Exadata Automatic Hard Disk Scrubbing process relies on the tight integration between the core components of the Exadata architecture: the Storage Servers (Cells) and Oracle Automatic Storage Management (ASM).

A. Role of Exadata Storage Servers (Cells)

The scrubbing process is executed by the Exadata System Software (specifically, the Cell Services – CELLSRV process) running on each Exadata Storage Server (Cell). The inspection is local to the cell where the scanned disk resides; data is not sent outside the cell during the sector check phase. This minimizes inter-cell network traffic for the inspection stage.

The Cell Software continuously monitors disk health and I/O utilization to determine when to start, pause, or throttle the scrubbing process. Typically, scrubbing begins or resumes when the average disk I/O utilization drops below a certain threshold (often cited as 25%).

B. Interaction with Oracle ASM for Detection and Repair

When the Exadata scrubbing process detects a bad sector on a hard disk, the procedure unfolds as follows :

Detection: The Cell Software identifies a physical read error or inconsistency during its periodic scan.
Request Submission: The Cell Software that detected the faulty sector automatically sends a repair request to the Oracle ASM instance managing the disk group containing that disk.
Repair by ASM: Upon receiving the request, ASM orchestrates the repair by reading a healthy copy of the data block (extent) containing the bad sector from another storage server where a mirrored copy resides.

This interaction exemplifies Exadata’s “Intelligent Storage” philosophy; low-level physical error detection happens within the cell, while ASM, which understands the database structure and data placement, coordinates the logical repair.

C. Leveraging ASM Mirroring for Data Recovery

Oracle ASM mirroring (Normal or High Redundancy) is fundamental to Exadata’s data protection strategy, and the repair capability of the scrubbing process is entirely dependent on this mechanism.

ASM distributes redundant copies (extents) of data blocks across different failure groups (which in Exadata are typically the Storage Servers). This ensures data accessibility even if an entire cell becomes unavailable, as data can be accessed from other copies.

When ASM receives a repair request triggered by scrubbing, it follows these steps:

Locate Healthy Copy: ASM identifies a disk on a different storage cell that holds a valid copy of the affected data block. ASM knows which disks are “partners” and where mirrored copies are stored.
Read Data: ASM reads the correct data from the disk containing the healthy copy.
Write Over Bad Sector: ASM uses the correct data read to overwrite the bad sector on the original disk, thus correcting the error.

The success of this repair mechanism hinges entirely on the existence of valid and accessible ASM mirrors. If a second disk failure occurs in a Normal Redundancy (2 copies) disk group before a rebalance completes, or if all three copies become inaccessible simultaneously in a High Redundancy (3 copies) group, scrubbing can detect the error, but ASM cannot repair it. This underscores why High Redundancy is strongly recommended for critical systems , as the extra copy significantly reduces the probability of losing all copies concurrently.

Furthermore, the scrubbing process not only repairs isolated bad sectors but can also serve as an early indicator of more severe disk problems. If numerous or persistent errors are detected during scrubbing, it can lead ASM to take the corresponding grid disk offline and initiate a rebalance operation to redistribute data onto the remaining healthy disks. In this context, scrubbing also acts as an early warning system that triggers ASM’s existing high availability (HA) mechanisms. Monitoring the V$ASM_OPERATION view during or after scrub periods is important for tracking such ASM recovery actions.

D. Types of Errors Detected

Exadata Automatic Hard Disk Scrubbing primarily focuses on detecting physical bad sectors and latent media errors on hard disk drives that might not be caught by standard drive ECC or operating system checks. Damaged or worn-out sectors or other physical defects fall under this scope.

The “logical defects” mentioned typically refer to low-level media inconsistencies rather than logical corruptions at the ASM or database level (which is the domain of ASM scrubbing ). The main goal is to find such issues before they impact data access or lead to silent data corruption.

III. Managing and Monitoring the Exadata Scrubbing Process

Effectively utilizing the Exadata Automatic Hard Disk Scrubbing feature requires proper configuration and continuous monitoring. The primary tool for these tasks is the CellCLI (Cell Command Line Interface) utility.

A. CellCLI Commands for Configuration

CellCLI is the main command-line interface for managing Exadata storage server features. Scrubbing-related configuration is done using the ALTER CELL command and specific attributes :

hardDiskScrubInterval: Determines how often the automatic scrubbing process runs. Valid options are:
- daily: Every day
- weekly: Every week
- biweekly: Every two weeks (default)
- none: Disables automatic scrubbing and stops any running process.
- Example: To set weekly scrubbing: CellCLI> ALTER CELL hardDiskScrubInterval=weekly.
hardDiskScrubStartTime: Sets when the next scheduled scrubbing process will start. Valid options are:
- A specific date and time (e.g., in ‘YYYY-MM-DDTHH:MI:SS-TZ’ format).
- now: Triggers the next scrubbing cycle to start immediately (after the current cycle finishes, or for the first run).
- Example: To start at a specific time: CellCLI> ALTER CELL hardDiskScrubStartTime='2024-10-26T02:00:00-07:00'.

To view the current scrubbing settings, use the command: CellCLI> LIST CELL ATTRIBUTES hardDiskScrubInterval, hardDiskScrubStartTime

Note that configuration is done on a per-cell basis, meaning these settings apply to all hard disks within a specific storage server. However, the “Adaptive Scrubbing Schedule” feature can automatically adjust the effective run frequency for specific disks identified as problematic, although the base schedule is configured cell-wide.

B. Monitoring Scrubbing Activity

Several methods are available to understand the status and impact of the scrubbing process:

CellCLI Metrics:
- The most direct way to see real-time scrubbing activity is using the LIST METRICCURRENT command. Specifically, the CD_IO_BY_R_SCRUB_SEC metric shows the read I/O generated by scrubbing in MB/second for each hard disk (CD). Non-zero values indicate active scrubbing on that disk.
- Example Command: CellCLI> LIST METRICCURRENT WHERE name = 'CD_IO_BY_R_SCRUB_SEC'
- Other related metrics (discoverable with LIST METRICDEFINITION WHERE name like '%SCRUB%') might provide additional information about scrubbing wait times or resource usage.
Cell Alert Logs:
- Informational messages indicating the start (Begin scrubbing celldisk) and finish (Finished scrubbing celldisk) of scrubbing operations are logged in the cell alert logs. These logs can be examined using ADRCI (Automatic Diagnostic Repository Command Interpreter) or directly from files under the $CELLTRACE directory. Messages related to errors encountered during scrubbing or disk issues will also appear in these logs.
- Example Command: CellCLI> LIST ALERTHISTORY WHERE message LIKE '%scrubbing%'
AWR Reports (Automatic Workload Repository):
- AWR reports, particularly in their Exadata-specific sections, provide aggregated information about scrubbing I/O activity that occurred during a specific snapshot period. Look for metrics labeled ‘scrub I/O’ in the report.
- Seeing high ‘scrub I/O’ in AWR during periods of low application I/O is normal and expected. However, understanding whether high scrub I/O correlates with performance degradation requires analyzing the overall system load, IORM configuration, and other sections in AWR like ‘Exadata OS I/O Stats’. AWR provides historical context for evaluating impact over time, while CellCLI metrics offer a real-time view.
Real-Time Insight:
- If configured, scrubbing metrics like CD_IO_BY_R_SCRUB_SEC can be sent to a preferred dashboard for visual monitoring of scrubbing activity across all Exadata cells.
ASM Views:
- While Exadata scrubbing doesn’t directly log to V$ASM_OPERATION, if scrubbing triggers an ASM repair or a subsequent rebalance, those operations can be monitored in V$ASM_OPERATION. The V$ASM_DISK_STAT view might also reflect I/O patterns related to scrubbing or repair.

C. Starting, Stopping, and Checking Status

Starting: Scrubbing starts automatically based on the hardDiskScrubInterval and hardDiskScrubStartTime settings. The hardDiskScrubStartTime=now setting can be used to trigger the next cycle immediately. There isn’t a direct command like “start scrubbing now.”
Stopping: To stop and disable automatic scrubbing, use the hardDiskScrubInterval=none command. This will also stop any currently running scrubbing process.
Status Check: There is no single “scrubbing status” command. The status is inferred through the monitoring methods described above (CellCLI metrics, logs, AWR) by looking at active I/O rates and log messages.

D. Table 1: Essential CellCLI Commands for Exadata Hard Disk Scrubbing

The following table summarizes the key CellCLI commands used to manage and monitor the Exadata hard disk scrubbing process:

Command	Purpose	Example	Sources
`ALTER CELL hardDiskScrubInterval = [daily\	weekly\	biweekly\
`ALTER CELL hardDiskScrubStartTime = [‘’\	now]`	Sets the start time for the next scheduled scrubbing operation.	`ALTER CELL hardDiskScrubStartTime=now`
`LIST CELL ATTRIBUTES hardDiskScrubInterval, hardDiskScrubStartTime`	Displays the current scrubbing schedule configuration.	`LIST CELL ATTRIBUTES hardDiskScrubInterval, hardDiskScrubStartTime`
`LIST METRICCURRENT WHERE name = 'CD_IO_BY_R_SCRUB_SEC'`	Monitors the real-time scrubbing I/O rate for each hard disk.	`LIST METRICCURRENT WHERE name = 'CD_IO_BY_R_SCRUB_SEC'`
`LIST ALERTHISTORY WHERE message LIKE '%scrubbing%'`	Checks logs for scrubbing start/finish/error messages.	`LIST ALERTHISTORY WHERE message LIKE '%scrubbing%'`

IV. Performance Impacts of the Scrubbing Process

While designed to proactively protect data integrity, Exadata Automatic Hard Disk Scrubbing does have an impact on system resources, particularly the I/O subsystem. Understanding and managing this impact is crucial.

A. Resource Consumption (CPU, I/O)

The primary resource consumed by the scrubbing process is Disk I/O. The operation involves reading sectors from the hard disks. On an otherwise idle system or disk, the scrubbing process can significantly increase disk utilization, potentially reaching close to 100% for the disk being scanned.

CPU consumption on the storage server (Cell) for the scrubbing check itself is generally low, as it’s largely an I/O-bound operation. However, if scrubbing detects an error and triggers a repair via ASM, that repair process (reading the good copy and writing it to the bad location) can consume additional resources (CPU and network) across cells and potentially database nodes, although the Exadata architecture aims to minimize this impact.

B. Designed Operating Window (Low I/O Utilization)

A key design principle to minimize the performance impact of Exadata scrubbing is that the process only runs when the storage server detects low average I/O utilization. This threshold is commonly cited as 25%.

The system automatically pauses or throttles scrubbing activity when I/O demand from the database workload exceeds this threshold. This mechanism aims to prevent scrubbing from significantly impacting production workloads.

However, there’s a nuance to the “25% utilization” threshold. It may not mean absolute idleness. There could be a persistent background I/O load running just below this threshold (e.g., 20-24%). Adding the scrubbing I/O on top of this existing load will increase the total I/O. While Exadata I/O Resource Management (IORM) prioritizes user I/O , even the minimal added load from scrubbing could potentially have a noticeable effect, especially for applications highly sensitive to very low latency. Therefore, while “low impact” is the goal, “zero impact” is not guaranteed.

C. Interaction with I/O Resource Management (IORM)

Exadata I/O Resource Management (IORM) plays a critical role in managing the performance impact of background tasks like scrubbing. IORM prioritizes and schedules I/O requests within the storage server based on configured resource plans.

IORM automatically prioritizes database workload I/O (e.g., user queries, OLTP transactions) over background I/O processes like scrubbing. This ensures minimal impact on application performance from scrubbing activity. IORM plans can be configured to manage resources among different databases or workloads, indirectly affecting the amount of resources available for background tasks like scrubbing.

D. Potential Performance Impact and Mitigation Methods

Despite being designed for low impact, it should be acknowledged that scrubbing can cause spikes in disk utilization and potentially increase latency, especially in situations where the system isn’t completely idle even when the “idle” threshold is met. The concern about performance impact, though often associated with general ASM scrubbing, can also apply to Exadata scrubbing.

To mitigate this potential impact, consider these strategies:

Scheduling: The most effective mitigation is to schedule the scrubbing process using hardDiskScrubStartTime and hardDiskScrubInterval during periods of genuinely low system activity (e.g., midnight, weekends).
Monitoring: Regularly assess when scrubbing runs and its actual impact in your specific environment using AWR and CellCLI metrics.
IORM Settings: Ensure IORM is configured appropriately for your workload priorities.
Adaptive Scheduling: Leverage Exadata’s adaptive scheduling feature. This automatically adjusts the frequency based on need, potentially reducing unnecessary runs on healthy disks.

E. Factors Affecting Scrubbing Duration

The time required to complete a scrubbing cycle depends on several factors:

Disk Size and Type: Larger capacity hard disks naturally take longer to scan. Estimates like 8-12 hours for a 4TB disk or 1-2 hours per terabyte when idle have been mentioned. Modern High Capacity (HC) drives are much larger (18TB in X9M , 22TB in X10M ), implying potentially much longer scrub times.
System Load: Since scrubbing pauses when user workload increases , the busier the system, the longer the total wall-clock time required to complete a scrub cycle. On a busy system, completing a cycle could take days.
Number of Errors Found: If many bad sectors are found, the time spent coordinating repairs with ASM can increase the total duration.
ASM Rebalance Activity: If scrubbing triggers a larger ASM rebalance operation, that separate process will consume its own resources and take time.
Configured Interval: While not affecting a single run’s duration, the interval determines how frequently the process starts.

It’s noteworthy that duration estimates in documentation (S2 vs S9) vary significantly. This highlights that estimates heavily depend on the Exadata generation (disk sizes/speeds), software version (potential efficiency improvements), and most importantly, the actual workload pattern and resulting “idle” time on the specific system. Relying on monitoring in your own environment is more accurate than general estimates. For instance, one observation noted a scrubbing rate of approximately 115MB/s per disk. At this rate, continuously scanning a 22TB disk (X10M ) would take roughly 54 hours. Given that scrubbing runs intermittently based on load , the actual completion time could be considerably longer.

V. Key Benefits of the Exadata Scrubbing Process

Exadata Automatic Hard Disk Scrubbing is a valuable feature that significantly contributes to the data integrity and high availability capabilities of the Exadata platform.

A. Proactive Detection of Latent Errors and Silent Data Corruption

Its most fundamental benefit is the proactive discovery of physical media errors before they are encountered during normal database operations. This prevents “silent” data corruption, where errors occur on disk but remain undetected until the data is read (which could be much later). By checking data blocks that haven’t been accessed recently , it ensures such hidden threats are uncovered.

B. Enhanced Data Integrity and Reliability

By detecting physical errors and enabling their repair, the scrubbing process directly contributes to the overall data integrity and reliability of the Exadata platform. This feature complements other protection layers like Oracle HARD (Hardware Assisted Resilient Data) checks , ASM mirroring , and database-level checks , providing robust defense against data corruption.

C. Automatic Repair Mechanism

A significant advantage is that the feature automates not just detection but also the initiation of the repair process. In typical bad sector scenarios, both error detection and the triggering of repair via ASM happen automatically, requiring no manual intervention. This reduces administrative overhead and ensures timely correction of detected issues.

D. Complements Other Exadata High Availability Features

Scrubbing is part of Exadata’s comprehensive Maximum Availability Architecture (MAA) strategy. It works alongside features like redundant hardware components , Oracle RAC for instance continuity , ASM for storage virtualization and redundancy , HARD for I/O path validation , and potentially Data Guard for disaster recovery.

This reinforces Exadata’s “defense in depth” approach to data protection. HARD checks the I/O path during writes ; database checks can verify logical structure ; ASM provides redundant copies of data ; and scrubbing proactively inspects the physical media at rest. No single feature covers all possible scenarios, but working together, they provide robust protection. Scrubbing forms a critical layer in this strategy, specifically targeting latent physical errors that might be missed by other mechanisms.

VI. Evolution of the Scrubbing Feature Across Exadata Versions

The Exadata Automatic Hard Disk Scrubbing feature has evolved along with the platform itself.

A. Feature Introduction

The Automatic Hard Disk Scrub and Repair feature was first introduced with Oracle Exadata System Software version 11.2.3.3.0. At that time, specific minimum database/Grid Infrastructure versions like 11.2.0.4 or 12.1.0.2 were required for the feature to function.

B. Adaptive Scrubbing Schedule

A significant enhancement arrived with Exadata System Software version 12.1.2.3.0: the Adaptive Scrubbing Schedule. With this feature, if the scrubbing process finds a bad sector on a disk, the Cell Software automatically schedules the next scrubbing job for that specific disk to run more frequently (typically weekly). This temporarily overrides the cell-wide hardDiskScrubInterval setting for that disk. If the subsequent, more frequent run finds no errors, the disk’s schedule reverts to the global hardDiskScrubInterval setting. This feature also requires specific minimum Grid Infrastructure versions to operate.

This adaptive approach makes scrubbing more efficient. Instead of frequently scanning all disks, it focuses more attention only on disks showing potential issues. This conserves I/O resources on healthy disks while providing quicker follow-up checks on suspect ones.

C. Other Related Developments (Post-12.1.2.3.0)

Available documentation primarily focuses on the introduction of the scrubbing feature and the adaptive scheduling enhancement. Detailed information about significant changes to algorithms, performance tuning (beyond IORM interaction), or reporting in later versions (e.g., post-12.x, 18.x, 19.x, 20.x, 21.x, 22.x, 23.x ) is not provided in the reviewed sources. Consulting the release notes for specific Exadata System Software versions might be necessary for details on newer developments.

D. Table 2: Evolution of Key Exadata Scrubbing Features

The following table summarizes the key milestones in the development of the Exadata scrubbing feature:

Exadata Software Version	Key Feature/Enhancement	Description
11.2.3.3.0	Automatic Hard Disk Scrub and Repair (Introduction)	Introduced the core feature for automatic, periodic inspection and initiation of repair via ASM.
12.1.2.3.0	Adaptive Scrubbing Schedule	Automatically increases scrubbing frequency (e.g., to weekly) for disks where bad sectors were recently detected.
Post-12.1.2.3.0	(Other Enhancements Unspecified)	(Specific major enhancements for later versions are not detailed in the provided documentation)

VII. Configuration and Best Practices

To maximize the benefits of the Exadata Automatic Hard Disk Scrubbing feature, proper configuration and adherence to Oracle’s Maximum Availability Architecture (MAA) principles are important.

A. Default Settings and Configuration Options

Default Schedule: By default, the scrubbing process is configured to run every two weeks (biweekly).
Configuration Options: The hardDiskScrubInterval (daily, weekly, biweekly, none) and hardDiskScrubStartTime (, now) attributes can be set via CellCLI.
No Intensity/Priority Setting: There is no direct CellCLI setting to control the “intensity” or “priority” of the scrubbing process itself. Its impact is primarily managed by the idle-time logic and IORM.

B. Recommended Scheduling Strategies for Production Environments

Use Defaults: For many environments, the default bi-weekly schedule and the automatic execution during low I/O periods are sufficient.
Customize Start Time: Rather than relying solely on now or random times, explicitly setting hardDiskScrubStartTime to known low-load periods (e.g., 2 AM Sunday morning) offers a more controlled approach.
Assess Workload: On very busy, 24/7 systems, evaluate if the biweekly interval allows enough time for the process to complete. If not, consider weekly, but closely monitor the performance impact. Disabling scrubbing (none) is generally not recommended unless there’s a specific, temporary reason, as it forfeits the proactive detection benefit.
Align with Maintenance Windows: Coordinate scrubbing schedules with other planned maintenance windows if possible, although the automatic throttling mechanism should prevent major conflicts.
Monitor Completion: Check logs to ensure scrubbing cycles complete successfully within the planned interval. If cycles consistently fail to complete due to high load, the scheduling strategy needs review.

C. Importance of ASM Redundancy

High Redundancy Recommendation: Using High Redundancy (3 copies) for ASM disk groups on Exadata is strongly recommended, especially for production databases.
Rationale: While scrubbing works with Normal Redundancy (2 copies), High Redundancy provides significantly better protection against data loss during the repair window (especially if an unrelated second failure occurs). Scrubbing’s repair capability depends on having a healthy mirror copy available.
Requirements: Properly implementing High Redundancy typically requires at least 5 failure groups (often 3 storage cells + 2 quorum disks on database servers for Quarter/Eighth Rack configurations).

D. Integration with Overall MAA Strategy

Scrubbing is just one part of the MAA best practices recommended by Oracle for Exadata :

Regular Health Checks: Run the exachk utility regularly (e.g., monthly) or rely on AHF (Autonomous Health Framework) to run it automatically to validate configuration against best practices, including storage and ASM settings.
Use Standby Database: While Exadata scrubbing and HARD checks protect against many issues, a physical standby database (Data Guard) on a separate Exadata system is critical for comprehensive protection against site failures, certain logical corruptions, and as a secondary validation source.
Monitoring: Implement comprehensive monitoring (OEM, AWR, CellCLI metrics, logs, Real-Time Insight) to track system health, performance, and background activities like scrubbing.
Testing: Validate recovery procedures and understand the behavior of features like scrubbing and ASM rebalance in your test environment.

E. Table 3: Exadata Scrubbing Configuration Attributes and Best Practices

This table consolidates key configuration parameters and actionable recommendations:

Parameter/Area	Configuration/Setting	Default	Recommendation
`hardDiskScrubInterval`	`daily`, `weekly`, `biweekly`, `none`	`biweekly`	Start with default. Consider `weekly` for busy systems if needed, monitoring impact. Avoid `none`.
`hardDiskScrubStartTime`	, `now`	None	Explicitly set to a known low-load window (e.g., weekend night).
ASM Redundancy	Normal (2 copies), High (3 copies)	Normal	Use High Redundancy for production disk groups to maximize repair success probability.
Monitoring	CellCLI Metrics, Cell Logs, AWR, ASM Views, `exachk`	None	Regularly monitor scrubbing activity, completion status, performance impact, and overall system health (`exachk`).
Scheduling Strategy	Workload-dependent	Idle-based	Schedule during predictably low-load times; ensure cycles complete.
MAA Integration	Part of overall HA	None	Integrate with Data Guard, regular health checks, and robust monitoring per MAA guidelines.

VIII. Conclusion

Oracle Exadata Automatic Hard Disk Scrub and Repair is a proactive defense mechanism crucial for maintaining data integrity and high availability on the Exadata platform. By periodically scanning hard disks on storage servers for physical errors, this feature detects latent corruptions, especially in infrequently accessed data, before they can impact applications.

The core strength of the scrubbing process lies in the integration between Exadata System Software and Oracle ASM. While the Cell Software detects the error, ASM manages the automatic repair process using mirrored copies. The effectiveness of this repair capability is directly tied to the correctly configured redundancy of ASM disk groups, particularly High Redundancy, which is strongly recommended for production environments.

From a performance perspective, the scrubbing process is designed to run during periods of low I/O utilization detected by the system and is managed by IORM. This aims to minimize the impact on production workloads. However, it remains important for administrators to monitor scrubbing activity via CellCLI metrics, alert logs, and AWR reports, and potentially adjust the schedule based on their environment’s specific workload patterns.

Introduced in Exadata 11.2.3.3.0 and enhanced with Adaptive Scheduling in 12.1.2.3.0, this feature is an integral part of Exadata’s multi-layered data protection strategy (including HARD checks, ASM mirroring, RAC, Data Guard, etc.). Properly configuring and operating Exadata Automatic Hard Disk Scrubbing is critical for preserving data integrity, preventing unexpected outages, and maximizing the value of the Exadata investment. For best results, scrubbing configuration and operation should be considered within the framework of Oracle MAA best practices, supported by regular system health checks (exachk) and comprehensive monitoring.

The post Comprehensive Guide to Oracle Exadata Automatic Hard Disk Scrubbing appeared first on Bugra Parlayan | Oracle Database & Exadata Blog.