Basic Procedure to Manage Downtime

Overview


Altibase is a provider of database solutions. For mission-critical systems, downtime often has serious negative repercussions. Even in the presence of a robust disaster recovery plan, a series of unforeseen events may result in database downtime.

This guide describes the most common methods to troubleshoot Altibase when downtime occurs for any reason. For the purposes of ensuring a rapid and effective response to any downtime events, database administrators should be fully familiarized with the contents of this document.

Abnormal increases in CPU utilization or delays caused by locks are not discussed by this guide. Please refer to the Performance Tuning Guide or the Administrator’s Manual for more information not contained within this document.

Classification of Downtime Events


Definition of Downtime

Altibase classifies downtime events into two distinct groups: emergency and non-emergency. The distinction between the two classifications is outlined in the table below:

Classification Description
Emergency Downtime The database is rendered completely inoperable due to a system error, database error or other catastrophic event
Non-emergency Downtime The database is operable but is experiencing issues or behaving abnormally

For emergency downtime events, Altibase provides immediate support for any customers with active maintenance agreements.

In contrast for non-emergency downtime events, Altibase will redirect the user to relevant guides or provide information that will allow the customer to rectify the error. If the customer is still not able to resolve the error, Altibase will provide direct support to any customers with active maintenance agreements.

Emergency Downtime

Emergency downtime is classified into two distinct groups:

Type Definition
System Abnormality Issues caused by the server hardware, operating system, or unrelated software
Database Abnormality Issues directly attributable to Altibase

An emergency downtime event is classified as any event that renders the database completely inoperable. In any such situation, the user should immediately provide Altibase with the following information:

Information Location
System logs Refer to section “System-related Downtime”
Database trace logs $ALTIBASE_HOME/trc
Supplemental information Any information regarding to database activities, system processes or logs that were being executed prior to the emergency downtime event

It is important that this information be sent along with the notification of the emergency downtime event to Altibase technical support. This will minimize the need for information requests and help expedite the resolution of the issue. Failure to send relevant information may prolong the amount of time it takes Altibase to provide a viable solution.

Troubleshooting Emergency Downtime Events

Step 1: Verify that the Altibase process is running.

Shell> ps –ef | grep “altibase –p boot from” | grep –v grep

If Altibase is operational, an Altibase process should be identified by this command.

If an Altibase process is identified, the following command will verify that Altibase is currently accepting connections.

Shell> isql –u [db user id] –p [db user password] –s [db IP address] –port [port_no]

Step 2: If no Altibase process is found, the following actions should be taken.

  • Send an email to support@altibase.com and attach all relevant logs. Refer to the emergency downtime section for a list of the requisite logs.
  • Call the Altibase Solution Center for next (US: +1-888-837-7333, Korea: +822-2082-1114).Step 3: Restart Altibase by executing the following commands from the user account that was used to install Altibase.

Step 3: Restart Altibase by executing the following commands from the user account that was used to install Altibase.

Shell> server kill

Shell> server start

The first command is to ensure that the Altibase instance is not running. The next command will start up the Altibase instance. These commands can only be executed by the user that installed Altibase.

Troubleshooting Specific Downtime Events


Connection Errors

This section describes five types of database connection errors that may occur even if the Altibase instance itself is operational.

Type Description
User Account Restrictions This error occurs when a connection is made with a larger number of file descriptors than defined in the property file for the user. The following errors will be recorded in the altibase_boot.log file:
ERR-01052(errno=24) Unable to invoke open() function on [~~~]
ERR-71016(errno=24) Failed to invoke a system function, accept() Dispatcher failed callback
Altibase Hangs This error occurs in any situation where the system or Altibase itself is unresponsive (e.g. unable to connect to database, no response, or connection error)
Invalid Connection Parameters This error occurs when incorrect IP address, port, username, or password parameters are used.
ERR-50032 : Client unable to establish connection.
ERR-31010 : User not found
ERR-4102E : Invalid password
Network Error This error occurs if there is any network issue (e.g. network card failure)
Insufficient Disk Space This error occurs if there is insufficient disk space available.
ERR-01052(errno=24) Unable to invoke open() function on [~~~]
ERR-01052(errno=24) Unable to invoke write() function on [~~~]

The previous errors are typically caused by user or hardware issues.  Suggested solutions are provided in the table below.

Type Checklist
User Account Restrictions Check for any file descriptor limits using the command ulimit -n. Increase the value if necessary and restart Altibase. The recommended file descriptor value is unlimited, but at minimum it should be set to equal or greater than 4096.
Invalid Connection Parameters Check the connection settings. Make sure the correct username, password, IP address and password have been provided.
Network Error Check for packet errors using the command netstat. Also check for device malfunctions by connecting to the server where Altibase is installed using ftp or telnet. Also check for abnormal slowdowns in packet transmission speed.
Insufficient Disk Space Check disk space using commands such as df (bdf) and take measures to acquire sufficient disk space.   Warning: Take special care not to accidentally delete Altibase’s redo logs. Deleting redo logs without backups may render database recovery impossible.

If the Altibase process exists but connection to the database instance is not possible, a database hang is the most likely culprit. If it is determined that Altibase is unresponsive, collect the information outlined in the table below and contact Altibase for technical support.

Operating System ObtainingHang Information
SUN /usr/sbin/pstack –F process_id > 1.txt

/usr/sbin/pstack –F process_id > 2.txt

/usr/sbin/pstack –F process_id > 3.txt

Execute these commands every 30 seconds in sequence.

HP Itanium-based (IA-64) systems support following commands but PA_RISC systems do not support these commands.

/usr/ccs/bin/pstack process_id > 1.txt

/usr/ccs/bin/pstack process_id > 2.txt

/usr/ccs/bin/pstack process_id > 3.txt

Execute these commands every 30 seconds in sequence.

AIX /usr/bin/procstack –F process_id > 1.txt

/usr/bin/procstack –F process_id > 2.txt

/usr/bin/procstack –F process_id > 3.txt

Execute these commands every 30 seconds in sequence.

LINUX Some systems with lower kernal version may not support following commands.

/usr/bin/pstack process_id > 1.txt

/usr/bin/pstack process_id > 2.txt

/usr/bin/pstack process_id > 3.txt

Execute above commands every 30 seconds in sequence.

Commands similar to pstack show the status of all process threads in detail. Such information is extremely useful in resolving database issues with an indeterminate root cause. Any such information should be forwarded to Altibase along with system logs and trace logs.

Insufficient Resource Error

Resources include both physical memory/disk space and the logical space utilized by Altibase. This section will describe various troubleshooting methods for any errors related to resources.

  • Insufficient Tablespace
Type Diagnosis Method
Insufficient Memory Tablespace Size The following error will occur if memory tablespace space is exhausted.

[ERR-110F1 : Unable to extend the tablespace(XXXXX) because the current size of tablespace(4194304K) becomes larger than MAXSIZE(4194304K) of the tablespace.]

Insufficient Disk Tablespace Size The following error will occur if disk tablespace space is exhausted.

[ERR-11123 : The tablespace does not have enough free space ( TBS Name :XXXXX ).]

As a hybrid database, Altibase supports of both in-memory and on-disk tables. As a result, both of these resources must be monitored.  If tablespace space is exhausted, the following actions must be taken.

Type Solution
Insufficient In-Memory Tablespace Size Insufficient User Memory Tablespace Size iSQL> ALTER TABLESPACE xxxxx ALTER AUTOEXTEND OFF;

iSQL> ALTER TABLESPACE xxxxx ALTER AUTOEXTEND ON MAXSIZE 1G;

Execute these commands in sequence.

The value of MAXSIZE should be larger than the current value of MAXSIZE.

Insufficient

SYS_TBS_MEM_DATA/

SYS_TBS_MEM_DIC

Tablespace Size

OR

When the ‘ALTER TABLESPACE’ command did not resolve the problem

The command mentioned above cannot be used with the SYS_TBS_MEM_DATA/SYS_TBS_MEM_DIC tablespace.

The command mentioned above also cannot be used if it would exceed the value of the MEM_MAX_DB_SIZE property.

In these cases, the following options are available. Outdated data may be removed from the table and then the compact command can be executed. Otherwise, the value of the MEM_MAX_DB_SIZE property can be increased.

iSQL> delete from [table name];

iSQL> truncate table [table name];

iSQL> alter table [table name] compact;

Insufficient On-Disk Tablespace Size iSQL> ALTER TABLESPACE xxxxx ADD DATAFILE ‘abcd.dbf’ SIZE 1G AUTOEXTEND OFF;

Replace the tablespace name, datafile name, and datafile size to the desired values.

The cause for the insufficient space error may signify a critical issue, and must be investigated promptly. While an increase in disk utilization may be perfectly normal, sudden spikes in disk utilization may signify an application or database issue. The size of database objects should be investigated to see if there were any sudden changes in disk utilization.

  • Insufficient Physical Disk Space 

If physical disk space is insufficient, problems will occur when redo logs can no longer be stored. In this situation, the database may appear to hang. The same problem will occur with Altibase’s trace logs. This has the additional consequence of making it difficult to determine the cause of the issue, as errors will no longer be written to the trace logs. In this situation, the only viable solution is to make additional disk space available.

  • Insufficient Physical Memory Space 

If physical memory space is insufficient, no solution can be performed while the database is providing service. If possible, the following command should be executed to obtain any information regarding the current status of Altibase.

iSQL> SELECT * FROM v$memstat ORDER BY max_total_size;

This command displays information about the resource utilization of Altibase’s internal in-memory modules. With this information, periodic result logs can be used to compare and analyze the modules that have a wide range of resource utilization change.

  • Long Running Transaction Performance Issues

Altibase supports MVCC(Multi-Version Concurrency Control), which is a concurrency control technique used to prevent wait times between SELECT and UPDATE operations.

However, MVCC techniques create garbage data that must be removed. Unfortunately, this data cannot be removed until all the transactions in a large UPDATE or SELECT job are processed. In this situation, the amount of redo log files or memory utilization may increase dramatically. The following query can be used to identify any query that has been executing for a significant period of time.

iSQL> SELECT * FROM v$statement WHERE total_time > 100000000 and execute_flag = 1;

This command finds any query that has been executing for longer than 100 seconds.

iSQL>select session_id, id, rpad(query, 150) from v$statement where tx_id = (select id from v$transaction where memory_view_scn in (select MINMEMSCNINTXS from v$memgc limit 1));

This command finds any query that is taking too long which hinders garbage collection.

System Error

The table below outlines the errors caused by insufficient system resources.

Error Type Description
Out of memory Insufficient memory space
Resource busy System resources temporarily inaccessible
Too many open files Limit of simultaneously open files exceeded
No space left on device Insufficient physical disk space

If Altibase’s trace logs reflect any of the aforementioned errors, it is likely that associated system errors were recorded to system logs. Using these system error codes, it is possible to identify which system resource caused the error.

The following system logs can be used to identify system errors:

Operating System System Log
SUN /var/adm/message
HP /var/adm/syslog/syslog.log
AIX errpt -a
LINUX /var/log/message

Replication Errors

Altibase supports replication using TCP/IP for the purposes of high availability. This section explains how to troubleshoot common replication errors.

Type Description
Replication Transmission Error When either the sender or the receiver is not operational due to network or configuration issues
Data Conflict Error When data cannot be replicated due to data inconsistencies between databases

If a replication issue occurs in either the sender or the receiver, the following commands should be executed to verify that they are operational:

Classification Verification Method
Sender iSQL> SELECT count(*) FROM v$repsender;
Receiver iSQL> SELECT count(*) FROM v$repreceiver;

If the sender and receiver were correctly configured in active-active mode, the returned value should be greater than 1 (the value should be identical to the number of replication objects).

The altibase_rp.log trace log records information relating to replication.

Message recorded during normal operation

[Recovery Sender] Replication REP1 Start… at [6030857] (Server log from replication started)

[Receiver] Replication REP1 Started … (Server log of receiving order for replication)

Message recorded if an error occurs when connecting to the sender (network issue or receiver issue)

ERR-61012(errno=111) [Sender] Failed to connect to the peer server

Message recorded if receiver is terminated

ERR-6104b(errno=0) [Receiver] REP1 receiver is ended (by thr_exit)

Message recorded if replication was terminated correctly by the receiver

RECEIVER:REPLICATION STOP MSG arrived!

This trace log should be analyzed to determine if any replication transmission errors were caused by explicit user commands, or if they were caused by temporary network issues. If the replication restart command has no effect and replication cannot be restarted, contact Altibase technical support for further assistance.

A problem that may occur during replication is data inconsistency caused by unsent transactions. A worst case scenario is when a replication issue occurs in the sender because of a dramatic increase in redo log files caused by unsent replication logs. Therefore, the replication should be monitored periodically to ensure that this issue does not occur.

iSQL> SELECT rep_name, rep_gap FROM v$repgap;

REP_NAME                       REP_GAP

——————————————————————

REP1                                     0

1 row selected.

The REP_NAME column represents the name of the replication object, while the REP_GAP column denotes the number of logs that must be sent. The REP_GAP value will change continuously as transactions are executed, but ideally the value should be close to zero. If the value is increasing consistently, this may signify a replication issue or an excessive number of executing transactions. Therefore, each server’s replication status and network should be monitored.

The altibase_rp.log trace file contains any information regarding data inconsistency errors related to replication.

PK conflict for INSERT operation (Duplicate data)

ERR-11058(errno=0) The row already exists in a unique index.

No corresponding data for DELETE operation (Not found)

ERR-61036(errno=0) [Receiver] err_not found in deleteXlog()

ERR-61000(errno=0) The received record is not found in the database.

No corresponding data for UPDATE operation (Not found)

ERR-6103a(errno=0) [Receiver] err_not_found in updateXlog()

ERR-61000(errno=0) The received record is not found in the database.

The original value of target record is different from the original value of replicated UPDATE operation (This problem can occur in active/action replication usage)

ERR-61035(errno=0) [Receiver] An update conflict encountered.

ERR-61001(errno=0) A conflict has been occurred while executing the received statement.

Each type of data conflict will also return the SQL statement that caused the conflict. This information is critical to finding the source of the problem.

The root cause of the issue is usually related with the fact that the original data values with the same PK were different from the first or changed on two servers at the exact same time. Therefore, the usage pattern of the application must be closely monitored based on the SQL statement.

Miscellaneous


Technical Support Structure

Altibase provides 24/7 technical support for all customers who have active maintenance support agreements in place. The support structure is outlined in the diagram below:

3_Fig1

Altibase also provides an online technical support through altibase.com portal that provides a wealth of information regarding Altibase’s products and services. Users can also utilize the support portal for patches, technical documents and Q&A.

Altibase’s Trace Logs

Altibase’s trace logs are located in the $ALTIBASE_HOME/trc directory. The table below outlines the trace files available in the aforementioned directory:

Main TraceLogfile Description
altibase_boot.log Contains information about the database’s status
altibase_sm.log Contains information about checkpoints and tablespaces
altibase_rp.log Contains information about replication-related events
altibase_qp.log Contains information about user query usages

Basic Monitoring

This section summarizes the basic elements of an Altibase implementation that should be monitored.

The “Shell” and “iSQL” statements denote whether the command is to be executed against the operating system or in the iSQL utility.

Monitoring Item Method or Command Expected Results
Verify Altibase’s server process Shell> ps –ef|grep “altibase” –p boot from | grep –v grep Should return more than 1 result
Verify available memory space Use vmstat or a similar utility to review current memory utilization A minimum available memory should be 20% of total memory
Altibase’s memory utilization iSQL> select sum(max_total_size) from v$memstat; To monitor abnormal sudden increases in usages
Altibase’s memory database allocation select trunc ( (mem_alloc_page_count*32*1024) / mem_max_db_size * 100.0, 2) from v$database; To ensure that memory database allocation is no more than 90% of total memory
System’s physical disk utilization Shell> df –k To ensure that sufficient disk space is available and that there is no abnormal spikes in disk
Altibase’s disk database

allocation

select

a.name, a.ALLOCATED_PAGE_COUNT, sum(b.maxsize)

from v$tablespaces a, v$datafiles b

where a.id = b.spaceid

group by a.name, a.ALLOCATED_PAGE_COUNT;

To monitor the current allocated space and determine whether physical datafiles must be added
Altibase’s trace logs Check messages starting with “ERR-”

Shell> tail –f altibase_boot.log | grep –v “ERR\-“

To troubleshoot any errors contained in trace logs.
Verify that the following message is periodically posted to the altibase_sm.log

Remove Online Log File at LFG [0]: File[11252 ~ 11253]

To ensure that checkpoints are being processed and that log files are being deleted. Check more if the number part is consistently displayed as “None”.
Altibase’s replication status iSQL> select rep_name, rep_gap from v$repgap; To ensure that the replication gap is not steadily increasing.

Copyright ⓒ 2000~2016 Altibase Corporation. All Rights Reserved.

These documents are for informational purposes only. These information contained herein is not warranted to be error-free and is subject to change without notice. Decisions pertaining to Altibase’s product characteristics, features and development roadmap are at the sole discretion of Altibase. Altibase may own related patents, trademarks, copyright or other intellectual property rights of products and/or features discussed in this document.