<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>OraInternals</title>
	<atom:link href="http://www.orainternals.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.orainternals.com</link>
	<description>A company specializing in RAC, Performance, and E-business suite</description>
	<lastBuildDate>Sun, 29 Apr 2012 17:27:50 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3</generator>
		<item>
		<title>_gc_fusion_compression</title>
		<link>http://www.orainternals.com/2012/04/29/_gc_fusion_compression/</link>
		<comments>http://www.orainternals.com/2012/04/29/_gc_fusion_compression/#comments</comments>
		<pubDate>Sun, 29 Apr 2012 17:23:36 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[11g]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[Oracle database internals]]></category>
		<category><![CDATA[Performance tuning]]></category>
		<category><![CDATA[RAC]]></category>
		<category><![CDATA[cache fusion internals]]></category>
		<category><![CDATA[_gc_fusion_compression]]></category>

		<guid isPermaLink="false">http://www.orainternals.com/?p=1389</guid>
		<description><![CDATA[We know that database blocks are transferred between the nodes through the interconnect, aka cache fusion traffic. Common misconception is that packet transfer size is always database block size for block transfer (Of course, messages are smaller in size). That&#8217;s not entirely true. There is an optimization in the cache fusion code to reduce the [...]]]></description>
			<content:encoded><![CDATA[<p>
 We know that database blocks are transferred between the nodes through the interconnect, aka cache fusion traffic. Common misconception is that packet transfer size is <em>always</em> database block size for block transfer (Of course, messages are smaller in size). That&#8217;s not entirely true. There is an optimization in the cache fusion code to reduce the packet size  (and so reduces the bits transferred over the private network). Don&#8217;t confuse this note with Jumbo frames and MTU size, this note is independent of MTU setting.
</p>
<p>
In a nutshell, if free space in a block exceeds a threshold (_gc_fusion_compression) then instead of sending the whole block, LMS sends a smaller packet, reducing private network traffic bits. Let me give an example to illustrate my point. Let&#8217;s say that the database block size is 8192 and a block to be transferred is a recently NEWed block, say, with 4000 bytes of free space. Transfer of this block over the interconnect from one node to another node in the cluster will result in a packet size of ~4200 bytes. Transfer of bytes representing free space can be avoided completely, just a symbolic notation of free space begin offset and free space end offset is good enough to reconstruct the block in the receiving side without any loss of data.This optimization makes sense as there is no need to clog the network unnecessarily.
</p>
<p><span id="more-1389"></span></p>
<p>
Remember that this is not a compression in a traditional sense, rather, avoidance of sending unnecessary bytes.
</p>
<p>
 Parameter _gc_fusion_compression determines the threshold and defaults to 1024 in 11.2.0.3. So, if the free space in the block is over 1024 then the block is candidate for the reduction in packet size.
</p>
<p><b> Test cases and dumps </b></p>
<p>
From the test cases, I see that three fields in the block can be used to determine the free space available in the block. If you dump a block using &#8216;alter system dump datafile..&#8217; syntax, you would see the following three fields:
</p>
<pre>
fsbo=0x26
fseo=0x1b6a
avsp=0x1b44
</pre>
<p>
fsbo stands for Free Space Begin Offset; fseo stands for Free Space End Offset; avsp stands for AVailable free SPace;
</p>
<p>
It <i> seems </i> to me from the test cases that LMS process looks up these fields and constructs the buffer depending upon the value of avsp field. If avsp exceeds 1024 then the buffer is smaller than 8K ( smaller than 7K for that matter). Following few lines explains my test results.
</p>
<p>
Initially, I had just one row (row length =105 bytes), and the wireshark packet analysis shows that one 8K block transfer resulted in a 690 bytes packet transfer. Meaning, the size of network packet was just 690 bytes for on 8192 block transfer. A massive reduction in GC traffic.
</p>
<p>
In test case #2, with 10 rows in the block, size of the packet transfer was 1680 bytes. Block dump shows that avsp=0x1b44 (6980 bytes) buckets with just 1212 bytes of useful information. Cache fusion code avoided sending 6980 bytes and reduced the transferred packet size to just 1680 bytes.
</p>
<p>
In test case #3, with 50 rows in the block, size of the transferred packet was 5776 bytes. free space was 2620 bytes in the block.
</p>
<p>
This behavior continued until the free space was just above 1024. When the free space was below 1024 (I accidentally added more rows and so free space dropped to ~900 bytes), then whole block was transferred and the size of packet was 8336 bytes.
</p>
<pre>
fsbo=0x96
fseo=0x402
avsp=0x36c
</pre>
<p>
  These test cases prove that cache fusion code is optimizing the packet transfer by eliminating the bytes representing free space.
</p>
<p><b> More test cases </b></p>
<p> So, what happens if you delete rows in the block? Remember that rows are not physically deleted and just tagged with a D flag in the row directory and so, free space information remains the same. Even if you delete 90% of the rows in the block, until block defragmentation happens, avsp field is not updated. This means that just deletion of rows will still result in whole block transfer, until the block is defragmented.
</p>
<pre>
# After deletion of nearly all rows in the block.
fsbo=0x96
fseo=0x402
avsp=0x36c
</pre>
<p> I increased the value of _gc_fusion_compression parameter to 4096, then to a value of 8192. Repeated the tests. Behavior is confirmed: When I set this parameter to a value of 8192, a block with just one row transfer resulted in a packet size of 8336, meaning, this optimization simply did not kick in ( as the free space in the block will never be greater than 8192).
</p>
<p><b> !!!Warning!!! </b></p>
<p> Yes, with 0&#215;6 exclamation symbols! This note is to improve the understanding of cache fusion traffic, not a recommendation for you to change it. This parameter better left untouched.
</p>
<p> This is a very cool optimization feature. Useful in data warehouse databases with 32K block size. I am not sure, in which version this optimization was introduced though. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.orainternals.com/2012/04/29/_gc_fusion_compression/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Collaborate 2012 presentations</title>
		<link>http://www.orainternals.com/2012/04/29/collaborate-2012-presentations/</link>
		<comments>http://www.orainternals.com/2012/04/29/collaborate-2012-presentations/#comments</comments>
		<pubDate>Sun, 29 Apr 2012 17:21:42 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Oracle database internals]]></category>
		<category><![CDATA[Performance tuning]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[RAC]]></category>
		<category><![CDATA[dtruss]]></category>
		<category><![CDATA[pfiles]]></category>
		<category><![CDATA[pmap]]></category>
		<category><![CDATA[pstack]]></category>
		<category><![CDATA[strace]]></category>
		<category><![CDATA[truss]]></category>

		<guid isPermaLink="false">http://www.orainternals.com/?p=1380</guid>
		<description><![CDATA[&#160; 2012_326_Riyaj_scan_vip_haip_doc 2012_326_Riyaj_SCAN_VIP_HAIP_ppt 2012_327_Riyaj_pstack_truss_doc 2012_327_Riyaj_pstack_truss_etc]]></description>
			<content:encoded><![CDATA[<p>&nbsp;</p>
<p><a href="http://www.orainternals.com/wp-content/uploads/2012/04/2012_326_Riyaj_scan_vip_haip_doc.pdf">2012_326_Riyaj_scan_vip_haip_doc</a></p>
<p><a href="http://www.orainternals.com/wp-content/uploads/2012/04/2012_326_Riyaj_SCAN_VIP_HAIP_ppt.pdf">2012_326_Riyaj_SCAN_VIP_HAIP_ppt</a></p>
<p><a href="http://www.orainternals.com/wp-content/uploads/2012/04/2012_327_Riyaj_pstack_truss_doc.pdf">2012_327_Riyaj_pstack_truss_doc</a></p>
<p><a href="http://www.orainternals.com/wp-content/uploads/2012/04/2012_327_Riyaj_pstack_truss_etc.pdf">2012_327_Riyaj_pstack_truss_etc</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.orainternals.com/2012/04/29/collaborate-2012-presentations/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Temporary tablespace in RAC</title>
		<link>http://www.orainternals.com/2012/04/29/temporary-tablespace-in-rac/</link>
		<comments>http://www.orainternals.com/2012/04/29/temporary-tablespace-in-rac/#comments</comments>
		<pubDate>Sun, 29 Apr 2012 17:15:08 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Oracle database internals]]></category>
		<category><![CDATA[Performance tuning]]></category>
		<category><![CDATA[RAC]]></category>
		<category><![CDATA[CI enqueue]]></category>
		<category><![CDATA[DFS lock handle]]></category>
		<category><![CDATA[oracle performance]]></category>
		<category><![CDATA[RAC performance]]></category>
		<category><![CDATA[SS enqueue]]></category>
		<category><![CDATA[temporary tablepsace]]></category>

		<guid isPermaLink="false">http://www.orainternals.com/?p=1366</guid>
		<description><![CDATA[Temporary tablespaces are shared objects and they are associated to an user or whole database (using default temporary tablespace). So, in RAC, temporary tablespaces are shared between the instances. Many temporary tablespaces can be created in a database, but all of those temporary tablespaces are shared between the instances. Hence, temporary tablespaces must be allocated [...]]]></description>
			<content:encoded><![CDATA[<p>Temporary tablespaces are shared objects and they are associated to an user or whole database (using default temporary tablespace). So, in RAC, temporary tablespaces are shared between the instances. Many temporary tablespaces can be created in a database, but all of those temporary tablespaces are shared between the instances. Hence, temporary tablespaces must be allocated in shared storage or ASM. We will explore the space allocation in temporary tablespace in RAC, in this blog entry.</p>
<p>In contrast, UNDO tablespaces are owned by an instance and all transactions from that instance is exclusively allocated in that UNDO tablespace. Remember that other instances can read blocks from remote undo tablespace, and so, undo tablespaces also must be allocated from shared storage or ASM.</p>
<p><strong> Space allocation in TEMP tablespace </strong></p>
<p>TEMP tablespaces are divided in to extents (In 11.2, extent size is 1M, not sure whether the size of an extent is controllable or not). These extent maps are cached in local SGA, essentially, soft reserving those extents for the use of sessions connecting to that instance. But, note that, extents in a temporary tablespace are not cached at instance startup, instead instance caches the extents as the need arises. We will explore this with a small example:</p>
<p><span id="more-1366"></span></p>
<p>This database has two instances and a TEMP tablespace. TEMP tablespace has two temp files, 300M each.</p>
<pre>Listing 1-1: dba_temp_files

  1* select file_name, bytes/1024/1024 sz_in_mb from dba_temp_files
SYS@solrac1:1&gt;/

FILE_NAME                                                      SZ_IN_MB
------------------------------------------------------------ ----------
+DATA/solrac/tempfile/temp.266.731449235                            300
+DATA/solrac/tempfile/temp.448.775136163                            300</pre>
<p>Initially, no extents were cached, and no extents were in use as shown from the output of gv$temp_extent_pool view in Listing 1-2.</p>
<pre>Listing 1-2: Initial view of temp extents

select inst_id, file_id, extents_cached, extents_used from gv$temp_extent_pool order by 1,2;

   INST_ID    FILE_ID EXTENTS_CACHED EXTENTS_USED
---------- ---------- -------------- ------------
         1          1              0            0
         1          2              0            0
         2          1              0            0
         2          2              0            0</pre>
<p>We are ready to start a test case</p>
<pre>Listing 1-3: Script in execution

select inst_id, file_id, extents_cached, extents_used from gv$temp_extent_pool order by 1,2;
  INST_ID    FILE_ID EXTENTS_CACHED EXTENTS_USED
---------- ---------- -------------- ------------
         1          1              0            0
         1          2              0            0
         2          1             22           22
         2          2             23           23
...
/
   INST_ID    FILE_ID EXTENTS_CACHED EXTENTS_USED
---------- ---------- -------------- ------------
         1          1              0            0
         1          2              0            0
         2          1            108          108
         2          2            111          111</pre>
<p>I started a small SQL script that joins multiple tables with hash join so as to induce disk based sorting. After starting the SQL script execution in instance 2, you can see that extents are cached and used in the instance 2, as shown in Listing 1-3. Initially, 45 extents were in use, few seconds later, temp tablespace usage grew to 219 extents.</p>
<pre>Listing 1-4: script completion

  INST_ID    FILE_ID EXTENTS_CACHED EXTENTS_USED
---------- ---------- -------------- ------------
         1          1              0            0
         1          2              0            0
         2          1            163            0
         2          2            166            0</pre>
<p>After the completion of the script,as shown in Listing 1-4, extents_used column is set to 0, But extents_cached is still at maximum usage level (319 extents). Meaning that, extents are cached (soft reserved) in an instance and not released (until another instance asks for it, as we see later).</p>
<p>You should also note that extents are equally spread between two files in that temporary tablespace. If you have more files in that temporary tablespace, then the extents will be uniformly allocated in all those temp files.</p>
<p><strong> Space reallocation </strong></p>
<p>Even if the cached extents are free, these extents are not available to use in other instance(s) immediately. An instance will request the owning instance to uncache the extents and then only those extents are available for use in the requesting instance. We will demonstrate this concept with the same test case, except that we will execute that test case in instance 1.</p>
<pre>Listing 1-5: script in instance #1 execution

  INST_ID    FILE_ID EXTENTS_CACHED EXTENTS_USED
---------- ---------- -------------- ------------
         1          1             42           42
         1          2             42           42
         2          1            163            0
         2          2            166            0</pre>
<p>At the start of SQL execution, instance started to reserve extents by caching them. My session was using those extents as visible from gv$temp_extent_pool. Number of extents used by the instance #1 was slowly growing.See Listing 1-5.</p>
<pre>Listing 1-6: instance #1 stole the extents from instance #2

   INST_ID    FILE_ID EXTENTS_CACHED EXTENTS_USED
---------- ---------- -------------- ------------
         1          1            195           71
         1          2            133          116
         2          1             63            0 &lt;-- note here
         2          2            166            0</pre>
<p>It gets interesting. Notice that 329 extents were reserved In Listing 1-5. Since my SQL script needs 329M of space in the temp tablespace, instance 1 needs to steal space from instance 2.</p>
<p>In Listing 1-6, Instance 1 needed more extents and so, Instance 2 uncached 100 extents as the extents_cached column went down from a value of 163 to 63 extents (third row in the output above). Essentially, in this example, instance 1 requested instance 2 to uncache the extents and instance 2 obliged and uncached 100 extents. Prior to 11g, un-caching of extents used to be at one extent per request. From 11g onwards, 100 extents are released for a single request and all 100 extents are acquired by the requesting instance. Instance 1 acquired those 100 extents, cached those extents, and then the session continued to use those temp extents.</p>
<pre>Listing 1-7: script completion and node #1 has more extents cached.

  INST_ID    FILE_ID EXTENTS_CACHED EXTENTS_USED
---------- ---------- -------------- ------------
         1          1            195            0
         1          2            133            0
         2          1             63            0
         2          2            166            0</pre>
<p>After the completion of the script execution, instance 1 did not release the extents. Cached extents are not released (extents are soft reserved )until another instance asks for those extents to be un-cached.</p>
<p>I also enabled sql trace in my session from instance 1 while executing the script. SQL trace file spills out the details about un-reserving of these extents.</p>
<pre>Listing 1-8: SQL Trace
...
#1: nam='enq: TS - contention' ela= 4172867 name|mode=1414725636 tablespace ID=3 dba=2 obj#=0 tim=6322835898
#2: nam='enq: SS - contention' ela= 608 name|mode=1397948417 tablespace #=3 dba=2 obj#=0 tim=6322837101
#3: nam='enq: SS - contention' ela= 414 name|mode=1397948422 tablespace #=3 dba=2 obj#=0 tim=6322837710
#4: nam='DFS lock handle' ela= 389 type|mode=1128857605 id1=14 id2=1 obj#=0 tim=6322838264
#5: nam='DFS lock handle' ela= 395 type|mode=1128857605 id1=14 id2=3 obj#=0 tim=6322838788
#6: nam='DFS lock handle' ela= 260414 type|mode=1128857605 id1=14 id2=2 obj#=0 tim=6323099335
...</pre>
<p>Line #1 above shows a tablespace level lock (TS enqueue) is taken on TEMP tablespace (ID=3 is ts# column in sys.ts$ table). Then SS locks were acquired on that tablespace, first with mode=1 and then with mode=6 (line #2 and #3). In Line #4, Cross Invocation Call (CIC) was used to ask remote SMON process to un-reserve the cached extents using CI type locks and <a href="http://orainternals.wordpress.com/2011/11/08/troubleshooting-dfs-lock-handle-waits/">DFS lock handle</a> mechanism with lock types CI-14-1, CI-14-2, and CI-14-3.</p>
<pre>Listing 1-9: Enqueue type

select chr(bitand(&amp;&amp;p1,-16777216)/16777215) || chr(bitand(&amp;&amp;p1,16711680)/65535) type,
mod(&amp;&amp;p1, 16) md from dual;
Enter value for p1: 1397948422

TY         MD
-- ----------
SS          6

Enter value for p1: 1128857605
TY         MD
-- ----------
CI          5</pre>
<p>From Listing 1-8, Approximately, 4.5 seconds were spent to move the cached extents from the one instance to another instance. Prior to 11g, this test case will run much longer, since the extents were un-cached 1 extent per request. Hundreds of such request would trigger tsunami of SS, CI enqueue requests leading to massive application performance issues. In 11g, Oracle Development resolved this issue by un-caching 100 extents per request.</p>
<p><strong> Important points to note </strong></p>
<ol>
<li>As you can see, extents are allocated from all temporary files uniformly. There are also changes to file header block during this operation. This is one of the reason, to create many temporary files in RAC. Recommendation is to create, as many files as the # of instances. If you have 24 nodes in your RAC cluster, yes, that would imply that you would have to create 24 temp files to the TEMP tablespace.</li>
<li>As we saw in our test case locking contention output, having more temp tablespace might help alleviate SS enqueue contention since SS locks are at tablespace level. Essentially, more temporary tablespace means more SS enqueues, But, you will move the contention from SS locks to &#8216;DFS lock handle&#8217; waits as Cross invocation Call is one per the instance for extents un-caching operation.</li>
<li>Temporary tablespace groups is of no use since the contention will be at Cross Invocation Call. In fact, there is a potential for temporary tablespace groups to cause more issues since the free space in one temp tablespace can not be reassigned to another temp tablespace dynamically, even if they are in the same tablespace group. In theory, it is possible to have more SS, CI locking contention with temp tablespace groups.</li>
<li>Probably a good approach is to assign different temporary tablespace to OLTP users and DSS users and affinitize the workload to specific instances.</li>
</ol>
<p>Update 1: Remember that you need to understand your application workload before following my advice</p>
]]></content:encoded>
			<wfw:commentRss>http://www.orainternals.com/2012/04/29/temporary-tablespace-in-rac/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>gc buffer busy acquire vs release</title>
		<link>http://www.orainternals.com/2012/04/29/gc-buffer-busy-acquire-vs-release/</link>
		<comments>http://www.orainternals.com/2012/04/29/gc-buffer-busy-acquire-vs-release/#comments</comments>
		<pubDate>Sun, 29 Apr 2012 17:13:07 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[11g]]></category>
		<category><![CDATA[Oracle database internals]]></category>
		<category><![CDATA[Performance tuning]]></category>
		<category><![CDATA[RAC]]></category>
		<category><![CDATA[gc buffer busy]]></category>
		<category><![CDATA[gc buffer busy acquire]]></category>
		<category><![CDATA[gc buffer busy release]]></category>
		<category><![CDATA[RAC performance]]></category>

		<guid isPermaLink="false">http://www.orainternals.com/?p=1375</guid>
		<description><![CDATA[Last week (March 2012), I was conducting Advanced RAC Training online. During the class, I was recreating a &#8216;gc buffer busy&#8217; waits to explain the concepts and methods to troubleshoot the issue. Definitions Let&#8217;s define these events first. Event &#8216;gc buffer busy&#8217; event means that a session is trying to access a buffer,but there is [...]]]></description>
			<content:encoded><![CDATA[<p>
Last week (March 2012), I was conducting Advanced RAC Training online. During the class, I was recreating a &#8216;gc buffer busy&#8217; waits to explain the concepts and methods to troubleshoot the issue.
</p>
<p><b> Definitions </b></p>
<p>
Let&#8217;s define these events first. Event &#8216;gc buffer busy&#8217; event means that a session is trying to access a buffer,but there is an open request for Global cache lock for that block already, and so, the session must wait for the GC lock request to complete before proceeding. This wait is instrumented as &#8216;gc buffer busy&#8217; event.
</p>
<p>
From 11g onwards, this wait event is split in to &#8216;gc buffer busy acquire&#8217; and &#8216;gc buffer busy release&#8217;. An attendee asked me to show the differentiation between these two wait events. Fortunately, we had a problem with LGWR writes and we were able to inspect the waits with much clarity during the class.
</p>
<p>
Remember that Global cache enqueues are considered to be owned by an instance. From 11g onwards, gc buffer busy event differentiated between two cases: </p>
<ol>
<li> If existing GC open request originated from the local instance, then current session will wait for &#8216;gc buffer busy acquire&#8217;. Essentially, current process is waiting for another process in the local instance to acquire GC lock, on behalf of the local instance. Once GC lock is acquired, current process can access that buffer without additional GC processing (if the lock is acquired in a compatible mode). </li>
<li>  If existing GC open request originated from a remote instance, then current session will wait for &#8216;gc buffer busy release&#8217; event. In this case session is waiting for another remote session (hence another instance) to release the GC lock, so that local instance can acquire buffer.
     </li>
</ol>
<p><b> Example </b></p>
<p> Following output should show the differentiation with much clarity.
</p>
<p><span id="more-1375"></span></p>
<p> Notice that SID 53, instance is has open GC request for the block File #10, block #56051(line #1 in the output) and the session is waiting for &#8216;gc current request&#8217; (which is a placeholder event, btw). All processes requesting an access to this block in instance 1 waits for &#8216;gc buffer busy acquire&#8217;. Similarly, all  processes waiting for the block access in instance #2 is waiting for &#8216;gc buffer busy release&#8217;. Essentially, instance 1 sessions are waiting for local instance to acquire the GC lock, and instance 2 sessions are waiting for instance 1 to release the GC lock. Of course, LGWR is completely stuck in this case and so, Global cache layer is also nearly frozen.
</p>
<pre>
INST_ID    SID EVENT                   USERNAME   STATE    WIS P1_P2_P3_TEXT
------- ------ ----------------------- ---------- -------- -------------------------------
      1     53 gc current request      SYS        WAITING  26 file# 10-block# 560651-id# 16777217
      1     40 gc buffer busy acquire  SYS        WAITING  file# 10-block# 560651-class# 1
      1     60 gc buffer busy acquire  SYS        WAITING  file# 10-block# 560651-class# 1
      1     59 gc buffer busy acquire  SYS        WAITING  file# 10-block# 560651-class# 1
      1     58 gc buffer busy acquire  SYS        WAITING  file# 10-block# 560651-class# 1
      1     56 gc buffer busy acquire  SYS        WAITING  file# 10-block# 560651-class# 1
      1     55 gc buffer busy acquire  SYS        WAITING  file# 10-block# 560651-class# 1
      1     54 gc buffer busy acquire  SYS        WAITING  file# 10-block# 560651-class# 1
      1     53 gc buffer busy acquire  SYS        WAITING  file# 10-block# 560651-class# 1
      1     48 gc buffer busy acquire  SYS        WAITING  file# 10-block# 560651-class# 1
      2      1 gc buffer busy release  SYS        WAITING  file# 10-block# 560651-class# 1
      2     68 gc buffer busy release  SYS        WAITING  file# 10-block# 560651-class# 1
      2     65 gc buffer busy release  SYS        WAITING  file# 10-block# 560651-class# 1
      2     64 gc buffer busy release  SYS        WAITING  file# 10-block# 560651-class# 1
      2     69 gc buffer busy release  SYS        WAITING  file# 10-block# 560651-class# 1
      2     57 gc buffer busy release  SYS        WAITING  file# 10-block# 560651-class# 1
      2     43 gc buffer busy release  SYS        WAITING  file# 10-block# 560651-class# 1
      2     36 gc buffer busy release  SYS        WAITING  file# 10-block# 560651-class# 1
      2     47 log file sync           SYS        WAITING  22 buffer# 4450-sync scn 30839721- 0
</pre>
<p>
 In summary, this differentiation is useful. In most cases, &#8216;gc buffer busy&#8217; is a symptom and so, in this example, I would review instance 1 closely since the waits are &#8216;gc buffer busy acquire&#8217; in that instance and most probably, I would quickly start to diagnose session with sid=53 @inst=1</p>
]]></content:encoded>
			<wfw:commentRss>http://www.orainternals.com/2012/04/29/gc-buffer-busy-acquire-vs-release/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What is &#8216;rdbms ipc message&#8217; wait event?</title>
		<link>http://www.orainternals.com/2012/02/13/what-is-rdbms-ipc-message-wait-event/</link>
		<comments>http://www.orainternals.com/2012/02/13/what-is-rdbms-ipc-message-wait-event/#comments</comments>
		<pubDate>Mon, 13 Feb 2012 15:41:25 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Oracle database internals]]></category>
		<category><![CDATA[Performance tuning]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[LGWR]]></category>
		<category><![CDATA[oracle performance]]></category>
		<category><![CDATA[rdbms ipc message]]></category>
		<category><![CDATA[semtimedop]]></category>
		<category><![CDATA[truss]]></category>

		<guid isPermaLink="false">http://www.orainternals.com/?p=1356</guid>
		<description><![CDATA[Introduction There was a question about the wait event &#8216;rdbms ipc message&#8217; in Oracle-l list. Short answer is that &#8216;rdbms ipc message&#8217; event means that a process is waiting for an IPC message to arrive. Usually, this wait event can be ignored, but there are few rare scenarios this wait event can&#8217;t be completely ignored. [...]]]></description>
			<content:encoded><![CDATA[<p><strong> Introduction </strong></p>
<p>There was a question about the wait event &#8216;rdbms ipc message&#8217; in Oracle-l list. Short answer is that &#8216;rdbms ipc message&#8217; event means that a process is waiting for an IPC message to arrive. Usually, this wait event can be ignored, but there are few rare scenarios this wait event can&#8217;t be completely ignored.</p>
<p><strong> What is &#8216;rdbms ipc message&#8217; wait means? </strong></p>
<p>It is typical of Oracle Database background processes to wait for more work. For example, LGWR will wait for more work until another (foreground or background ) process request LGWR to do a log flush. In UNIX platforms, wait mechanism is implemented as a sleep on a specific semaphore associated with that process. This wait time is accounted towards database wait events &#8216;rdbms ipc message&#8217;.</p>
<p>Also note that, semaphore based waits are used in other wait scenarios too, not just &#8216;rdbms ipc message&#8217; waits.</p>
<p><strong>Time to Trace</strong></p>
<p>We will use UNIX utility TRUSS to trace system calls from LGWR; We will enable sql trace on LGWR process. Using the output of these two methods, we will explore this wait event.</p>
<pre>
-- First we will identify UNIX PID of LGWR
$ ps -ef|grep ora_lgwr_${ORACLE_SID}
  oracle  1508     1   0 13:27:33 ?           0:00 ora_lgwr_solrac1

-- Next we will map that to Oracle session sid.
select sid,b.serial#,b.program,b.username from v$process a,v$session b
where a.addr=b.paddr
and a.spid=&amp;proc_id
/
Enter value for proc_id: 1508
  SID    SERIAL# PROGRAM                                          USERNAME
----- ---------- ------------------------------------------------ ------------------------------
   19          1 oracle@solrac1 (LGWR)

-- Let's enable sqltrace on LGWR session

SYS@solrac1:1&gt;EXEC DBMS_MONITOR.session_trace_enable(session_id =&gt;19, serial_num=&gt;1, waits=&gt;TRUE, binds=&gt;FALSE);

PL/SQL procedure successfully completed.
</pre>
<p>Note, Enabling sql trace on LGWR is not a grand idea in a production environment, actually, even in environments where the performance is vital, so, try this test case only at home. Next section below prints few lines from the trace file of that LGWR process.</p>
<pre>*** 2012-02-10 13:42:31.333
WAIT #0: nam='rdbms ipc message' ela= 3000112 timeout=300 p2=0 p3=0 obj#=-1 tim=1269283430

*** 2012-02-10 13:42:34.333
WAIT #0: nam='rdbms ipc message' ela= 3000349 timeout=300 p2=0 p3=0 obj#=-1 tim=1272283936 -- line #1

*** 2012-02-10 13:42:37.334
WAIT #0: nam='rdbms ipc message' ela= 3000397 timeout=300 p2=0 p3=0 obj#=-1 tim=1275284461 -- line #2

*** 2012-02-10 13:42:40.334
WAIT #0: nam='rdbms ipc message' ela= 3000185 timeout=300 p2=0 p3=0 obj#=-1 tim=1278284833 -- line #3

*** 2012-02-10 13:42:41.157
WAIT #0: nam='rdbms ipc message' ela= 822820 timeout=300 p2=0 p3=0 obj#=-1 tim=1279107848 -- line #4
WAIT #0: nam='log file parallel write' ela= 604 files=2 blocks=2 requests=2 obj#=-1 tim=1279108889 -- line #5</pre>
<p>I tagged the lines with comment to improve readability and refer to those line numbers while reading this paragraph. There are few important points to understand:</p>
<ol>
<li>As you may know, these trace lines are printed <em>after</em> the completion of waits. In Line #1, LGWR process completed a wait of 3 seconds and woken up at 2012-02-10 13:42:34.333 time. Essentially, LGWR was sleeping for 3&nbsp;seconds between 13:42:31.333 and 13:42:34.333.</li>
<li>At Line #2 and Line #3, LGWR was sleeping for full three seconds.</li>
<li>But, in Line #4, elapsed time of &#8216;rdbms ipc message&#8217; is 0.8 seconds, meaning, LGWR slept for 0.8 seconds only (not full three seconds). Line #5 shows that LGWR wrote a log block to the log file.</li>
</ol>
<p>In a nutshell, LGWR is sleeping on &#8216;rdbms ipc message&#8217; for full 3 seconds if there is no work to be done. Another process can wakeup LGWR process and trigger a log write as indicated in line #4 and line #5. I will disable SQL trace on LGWR at this point and then truss LGWR process.</p>
<p><strong>LGWR truss</strong></p>
<p>Let&#8217;s truss the LGWR process and match the pattern with the trace file. Truss command I am printing here will work in Solaris/hp platform. In Linux, you would use strace -ttT command to trace the UNIX process.</p>
<pre>$ truss -d -E -p 1508
Base time stamp:  1328904448.3033  [ Fri Feb 10 14:07:28 CST 2012 ]
....
/1:     12.2573  0.0000 times(0xFFFFFD7FFFDFD790)                       = 276585
/1:     semtimedop(7, 0xFFFFFD7FFFDFD178, 1, 0xFFFFFD7FFFDFD190) (sleeping...)
/1:     15.2576  0.0000 semtimedop(7, 0xFFFFFD7FFFDFD178, 1, 0xFFFFFD7FFFDFD190) Err#11 EAGAIN
/1:     15.2578  0.0000 times(0xFFFFFD7FFFDFD790)                       = 276885
/1:     15.2578  0.0000 times(0xFFFFFD7FFFDFD790)                       = 276885
/1:     semtimedop(7, 0xFFFFFD7FFFDFD178, 1, 0xFFFFFD7FFFDFD190) (sleeping...)
/1:     18.2582  0.0000 semtimedop(7, 0xFFFFFD7FFFDFD178, 1, 0xFFFFFD7FFFDFD190) Err#11 EAGAIN
/1:     18.2586  0.0000 times(0xFFFFFD7FFFDFD790)                       = 277185
/1:     18.2587  0.0000 times(0xFFFFFD7FFFDFD790)                       = 277185
/1:     semtimedop(7, 0xFFFFFD7FFFDFD178, 1, 0xFFFFFD7FFFDFD190) (sleeping...)
/1:     21.2589  0.0000 semtimedop(7, 0xFFFFFD7FFFDFD178, 1, 0xFFFFFD7FFFDFD190) Err#11 EAGAIN
/1:     21.2590  0.0000 times(0xFFFFFD7FFFDFD790)                       = 277485
/1:     21.2591  0.0000 times(0xFFFFFD7FFFDFD790)                       = 277485
/1:     semtimedop(7, 0xFFFFFD7FFFDFD178, 1, 0xFFFFFD7FFFDFD190) (sleeping...)
/1:     23.8895  0.0000 semtimedop(7, 0xFFFFFD7FFFDFD178, 1, 0xFFFFFD7FFFDFD190) = 0
/1:     23.8898  0.0001 kaio(AIOWRITE, 261, 0x6023D000, 6144, 0xFC73DB480CADD400) = 0
/1:     23.8903  0.0000 kaio(AIOWRITE, 263, 0x6023D000, 6144, 0xFC73DDD00C7DD400) = 0
/1:     23.8908  0.0001 kaio(AIOWAIT, 0xFFFFFFFFFFFFFFFF)               = 0
/1:     23.8917  0.0001 kaio(AIOWAIT, 0xFFFFFD7FFFDFC250)               = -2748838584880
/1:     23.8919  0.0000 kaio(AIOWAIT, 0xFFFFFD7FFFDFC250)               = -2748838585528</pre>
<p>From the output of truss command above, we can understand LGWR behavior better. Second column prints the time offset from the base timestamp.</p>
<ol>
<li>At time offset 12.2573, LGWR went in to sleep calling the semtimedop call. I will explain about semtimedop system call shortly.</li>
<li>At time offset 15.2576, LGWR was woken up from the sleep (meaning, that semtimedop call returned after three seconds of timer expiry).</li>
<li>LGWR went to sleep again for three seconds immediately after(&nbsp; possibly because no work to be done) and woken up at 18.2582 time offset. Similar 3 seconds timer expiry occured at 21.2589 time offset also.</li>
<li>It gets interesting at 23.8895 time offset. LGWR was (rudely!) woken up at 23.8895 time offset without allowing the LGWR to sleep for full three seconds ( 2.6 seconds). You can see few kaio calls (kaio kernalized async i/o calls) later and LGWR was submitting asynchronous I/O calls to the redo log file.</li>
</ol>
<p><strong>semtimedop</strong></p>
<p>UNIX system call semtimedop is used by Oracle processes to sleep with a timer. In this example, LGWR called semtimedop system call with a 3 second timeout. Calling semtimedop system call with 3 seconds timer will suspend the process. Kernel will schedule the UNIX process in CPU (LGWR in this example) if one of these two conditions occur&nbsp; (a) Timer expired as requested by the process or (b) another process modifies this specific semaphore.</p>
<p>In our example, if there is no work, then, the LGWR process will sleep for full 3 seconds until the expiry of semaphore timer; If there is work to be done prior to that 3 seconds timer expiry, then another process will modify the semaphore associated with LGWR and so, Kernel will wakeup LGWR and schedule the LGWR in the CPU. LGWR will perform  a log flush sync subsequently.</p>
<p>Advantage with this approach of sleep using semaphore with a timer expiry is that process will not consume any CPU while suspended.</p>
<p>This semaphore based sleep is analogous to a parent sleeping on a Saturday morning. Parent will wakeup if the timer goes off at 7:30 AM or if the kids wakeup prior to that alarm <img src='http://www.orainternals.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> </p>
<p>Evidently, you can see that when the LGWR process is sleeping on semaphore, then the time is accounted towards the wait event &#8216;rdbms ipc message&#8217;. You should also remember that semaphores are created as semaphore sets and semctl system call allows the code to modify a specific semaphore in a semaphore set.</p>
<p><strong>semtimedop is used in numerous places</strong></p>
<p>Wait event &#8216;rdbms ipc message&#8217; is not the only wait event associated with semtimedop system call. There are numerous places semtimedop call is used. Let&#8217;s examine a row level locking contention with two processes.</p>
<pre>-- In session #1, we update a row.
SYS@solrac1:1&gt;@mypid

SPID
------------------------
3100
SYS@solrac1:1&gt;update rs.t_one set v1=v1 where n1=100;

1 row updated.
----------------
-- In session #2, we will find our PID and then, and then try updating the same row leading to locking contention.
----------------
SYS@solrac1:1&gt;@mypid

SPID
------------------------
3048

SYS@solrac1:1&gt;update rs.t_one set v1=v1 where n1=100;</pre>
<p>At this time, PID 3048 is waiting for a lock. We will TRUSS PID 3048, to explore this concept further.</p>
<pre>
-- In another UNIX window
$ truss -d -E -p 3048
...
11.4971  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) Err#11 EAGAIN
11.9974  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) Err#11 EAGAIN
12.4979  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) Err#11 EAGAIN
12.9982  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) Err#11 EAGAIN
12.9984  0.0000 times(0xFFFFFD7FFFDF6F50)                       = 613374
12.9985  0.0000 times(0xFFFFFD7FFFDF6F50)                       = 613374
13.4987  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) Err#11 EAGAIN
13.9992  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) Err#11 EAGAIN
14.4995  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) Err#11 EAGAIN
15.0008  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) Err#11 EAGAIN
15.0010  0.0000 times(0xFFFFFD7FFFDF6F50)                       = 613575
15.0010  0.0000 times(0xFFFFFD7FFFDF6F50)                       = 613575
15.5012  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) Err#11 EAGAIN
16.0016  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) Err#11 EAGAIN
16.5023  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) Err#11 EAGAIN
17.0030  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) Err#11 EAGAIN
17.0032  0.0000 times(0xFFFFFD7FFFDF6F50)                       = 613775
17.0032  0.0000 times(0xFFFFFD7FFFDF6F50)                       = 613775
17.2466  0.0000 semtimedop(7, 0xFFFFFD7FFFDF69E8, 1, 0xFFFFFD7FFFDF6A00) = 0</pre>
<p>Under the cover, PID 3048 is waiting on a semaphore (its own semaphore) in a tight loop with a 0.5 seconds timer expiry. So, essentially, process is sleeping on the semaphore with a 0.5 seconds timer, wakesup after 0.5 seconds of timer expiry, and sleeps again in a loop. At this time, From session #1, I rolled back the changes with a rollback command (commit also will behave exactly the same). Process associated with session #1, identified next waiter in the waiting queue, woke up the waiting process by modifying the semaphore of the waiting process, at 17.2466 time offset. You can see that sleep time for the last semtimedop call is just 0.24 seconds and the semtimedop call returned with a return code of zero. If the timer expired, then EAGAIN is returned as a return code, if not, then 0 is returned. Remember that EAGAIN is not an error and just a timer expiry.</p>
<p>But, in the case of enqueue contention, even though semaphore based sleeps were used, time is accounted towards the wait event &#8216;enq: TX &#8211; row lock contention&#8217;. As you see, semaphore based sleeps can be used for waits for many wait evens.</p>
<pre>
nam='enq: TX - row lock contention' ela= 13930545 name|mode=1415053318 usn<<16
</pre>
<p>mypid.sql:</p>
<pre>
select spid from v$process where
  addr = (select  paddr from v$session where sid=(select sid from v$mystat where rownum =1))
;
</pre>
<p><strong> Summary </strong></p>
<p>
In Summary, this wait event can be ignored. In rare cases, this used to be a platform bug in 7.0 database version, semtimedop will not return even if another process modifies the semaphore. So, generally, this event is an idle event and as such, should be ignored.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.orainternals.com/2012/02/13/what-is-rdbms-ipc-message-wait-event/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Nologging redo size</title>
		<link>http://www.orainternals.com/2012/01/26/nologging-redo-size/</link>
		<comments>http://www.orainternals.com/2012/01/26/nologging-redo-size/#comments</comments>
		<pubDate>Wed, 25 Jan 2012 19:05:24 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[11g]]></category>
		<category><![CDATA[Oracle database internals]]></category>
		<category><![CDATA[Performance tuning]]></category>
		<category><![CDATA[force logging redo size]]></category>
		<category><![CDATA[oracle performance]]></category>
		<category><![CDATA[redo size]]></category>
		<category><![CDATA[redo size script]]></category>

		<guid isPermaLink="false">http://www.orainternals.com/?p=1316</guid>
		<description><![CDATA[It is probably easy to calculate hourly redo rate or daily redo rate using AWR data. For example, my script awr_redo_size.sql can be used to calculate daily redo rate, and awr_redo_size_history.sql can be used to calculate hourly redo rate. Hourly redo rate is especially useful since you can export to an excel spreadsheet, graph it [...]]]></description>
			<content:encoded><![CDATA[<p>It is probably easy to calculate hourly redo rate or daily redo rate using AWR data. For example, my script <a href="http://www.orainternals.com/wp-content/uploads/2011/12/awr_redo_size_sql.txt">awr_redo_size.sql</a> can be used to calculate daily redo rate, and <a href="http://www.orainternals.com/wp-content/uploads/2011/12/awr_redo_size_history_sql1.txt">awr_redo_size_history.sql</a> can be used to calculate hourly redo rate. Hourly redo rate is especially useful since you can export to an excel spreadsheet, graph it to see redo rate trend.</p>
<p><strong>Introduction to Direct Mode Writes</strong></p>
<p>Direct mode operations write directly in to the database file skipping buffer cache.&nbsp; Minimal redo(aka invalidation redo) is generated, if the database is <em>not</em> in force logging mode. Keeping the database in no force logging mode is peachy as long as you don&#8217;t use Data guard, Streams, or Golden Gate.</p>
<p>Suddenly, business decide to use one of these log mining based replication products. This means that you must turn on <a href="http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10726/configbp.htm#i1013555">Force logging</a> at the database level so that replication tools can capture (just replay in the case of Data guard) the redo information correctly and consistently.</p>
<p>But, what if your application performs high amount of direct mode operation, such as insert /*+ append */ operations? Now, you need to estimate the redo size to identify the effect of FORCE LOGGING mode&nbsp; That estimation gets little tricky.</p>
<p><strong>Direct writes</strong></p>
<p>During direct mode operation, blocks are pre-formatted and written directly to the disk bypassing buffer cache. If the database is altered to Force logging mode, then still direct mode operations will write to the block. In addition to direct mode writes, these operations will generate redo for the blocks written directly, almost similar to writing the whole block in to the redo log files. This will increase redo size.</p>
<p>There are few statistics capturing the direct mode writes. Using these statistics, we can estimate the redo size for direct mode operations.</p>
<p><strong>Statistics</strong></p>
<p>Statistics &#8216;physical writes direct&#8217; includes mostly three component statistics as given below.</p>
<pre>Physical writes direct = &lt; writes to data file due to direct mode operations&gt; +
                              physical writes direct to temporary tablespace +
                              physical writes direct (LOB)</pre>
<p>To identify the size of direct writes to data file, excluding temp files, then the formula becomes trivial as :</p>
<pre>Physical writes to datafile = block_size * ( physical writes direct -
                                                physical writes direct to temporary tablespace )</pre>
<p>Script <a href="http://www.orainternals.com/wp-content/uploads/2011/12/awr_redo_nologging_size_sql.txt">awr_redo_nologging_size.sql</a> uses this formula to estimate the amount of redo size if the database is altered to FORCE Logging mode. One caution with this script is that, this script assumes an uniform block size( of what you specify , 8192 is default). If you use multiple block sizes in your database, then specify the biggest block size in use (or average!).&nbsp; Script will overestimate it, but it is better than underestimation.</p>
<pre>awr_redo_nologging_size.sql v1.00 by Riyaj Shamsudeen @orainternals.com

To generate Report about Redo rate from AWR tables

Enter the block size(Null=8192):
Enter past number of days to search for (Null=30):21

DB_NAME   REDO_DATE                redo_size (MB) phy_writes_dir (MB) phy_writes_dir_temp(MB)
--------- ------------------- ------------------- ------------------- -----------------------
...
TEST1      01/09/2012 00:00:00          554,967.92        4,337,470.54            4,048,463.09
TEST1      01/10/2012 00:00:00          725,161.69        7,631,308.52            7,311,254.35
TEST1      01/11/2012 00:00:00        1,417,910.43       11,022,558.04           10,424,339.66
TEST1      01/12/2012 00:00:00          162,109.27        2,756,108.79            2,658,140.35
TEST1      01/13/2012 00:00:00          736,137.74        5,449,356.39            5,107,896.82
TEST1      01/14/2012 00:00:00          880,102.10        3,494,355.88            3,119,470.18
...</pre>
<p>In the code output above, notice the line for 1/11/2012. <em>Estimated total</em> redo size is ~1,417GB if we alter the database to FORCE LOGGING mode at database level. Out of that 1417 GB redo size, ~600 GB of redo will be generated due to direct mode operations from the calculation: 11,022GB will be generated due to direct mode operations minus adjustment for direct writes to temporary tablespace of size 10,424GB (over 10TB writes to temporary tablespace).</p>
<p><strong>Example #2</strong><br />
In this example, notice 28-DEC-11. 62GB of redo estimated if alter the database to force logging mode. Out of that just 600MB of redo will be generated due to direct mode operation.</p>
<pre>DB_NAME   REDO_DATE            redo_size (MB) phy_writes_dir (MB) phy_writes_dir_temp(MB)
--------- --------------- ------------------- ------------------- -----------------------
...
TEST2     24-DEC-11                 19,149.11            2,796.57                2,361.68
TEST2     25-DEC-11                 18,362.74            1,630.83                1,379.95
TEST2     26-DEC-11                 60,097.50            3,867.92                3,303.37
TEST2     27-DEC-11                 55,696.98            3,266.89                2,756.84
TEST2     28-DEC-11                 62,971.37            4,650.37                4,096.75
TEST2     29-DEC-11                 62,167.32            3,839.07                3,255.70
TEST2     30-DEC-11                 64,072.57            4,462.38                3,788.39
...</pre>
<p><strong>Summary</strong><br />
In summary, we can estimate the amount of redo size if we alter the database to FORCELOGGING mode. This is a very useful estimation while implementing these replication tools.</p>
<p>Thanks to Kirti Deshpande and Kalyan Maddali for testing out my script. Of course, any mistake in the script is mine, only mine.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.orainternals.com/2012/01/26/nologging-redo-size/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Troubleshooting LMS processes</title>
		<link>http://www.orainternals.com/2012/01/20/troubleshooting-lms-processes/</link>
		<comments>http://www.orainternals.com/2012/01/20/troubleshooting-lms-processes/#comments</comments>
		<pubDate>Fri, 20 Jan 2012 15:04:55 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[11g]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[Oracle database internals]]></category>
		<category><![CDATA[Performance tuning]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[RAC]]></category>
		<category><![CDATA[video]]></category>
		<category><![CDATA[LMS tuning]]></category>
		<category><![CDATA[oracle performance]]></category>
		<category><![CDATA[RAC performance]]></category>
		<category><![CDATA[video RAC training]]></category>

		<guid isPermaLink="false">http://www.orainternals.com/?p=1305</guid>
		<description><![CDATA[This was created circa July 2011. Enjoy.]]></description>
			<content:encoded><![CDATA[<p>This was created circa July 2011. Enjoy.</p>
<div id="v-mabMGh2o-1" class="video-player"><embed id="v-mabMGh2o-1-video" src="http://s0.videopress.com/player.swf?v=1.03&amp;guid=mabMGh2o&amp;isDynamicSeeking=true" type="application/x-shockwave-flash" width="400" height="250" wmode="direct" seamlesstabbing="true" allowfullscreen="true" allowscriptaccess="always" overstretch="true"></embed></div>
]]></content:encoded>
			<wfw:commentRss>http://www.orainternals.com/2012/01/20/troubleshooting-lms-processes/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SCN &#8211; What, why, and how?</title>
		<link>http://www.orainternals.com/2012/01/20/scn-what-why-and-how/</link>
		<comments>http://www.orainternals.com/2012/01/20/scn-what-why-and-how/#comments</comments>
		<pubDate>Thu, 19 Jan 2012 20:43:55 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[11g]]></category>
		<category><![CDATA[Blog]]></category>
		<category><![CDATA[corruption]]></category>
		<category><![CDATA[Oracle database internals]]></category>
		<category><![CDATA[Performance tuning]]></category>
		<category><![CDATA[RAC]]></category>
		<category><![CDATA[recovery]]></category>
		<category><![CDATA[12371955]]></category>
		<category><![CDATA[get_system_change_number]]></category>
		<category><![CDATA[hot backup bug]]></category>
		<category><![CDATA[kgcmgas calls]]></category>
		<category><![CDATA[ORA-00600 [2252]]]></category>
		<category><![CDATA[SCN]]></category>
		<category><![CDATA[SCN bug]]></category>

		<guid isPermaLink="false">http://www.orainternals.com/?p=1268</guid>
		<description><![CDATA[In this blog entry, we will explore the wonderful world of SCNs and how Oracle database uses SCN internally. We will also explore few new bugs and clarify few misconceptions about SCN itself. What is SCN? SCN (System Change Number) is a primary mechanism to maintain data consistency in Oracle database. SCN is used primarily [...]]]></description>
			<content:encoded><![CDATA[<p>In this blog entry, we will explore the wonderful world of SCNs and how Oracle database uses SCN internally. We will also explore few new bugs and clarify few misconceptions about SCN itself.</p>
<p><strong>What is SCN?</strong></p>
<p>SCN (System Change Number) is a primary mechanism to maintain data consistency in Oracle database. SCN is used primarily in the following areas, of course, this is not a complete list:</p>
<ol>
<li>Every redo record has an SCN version of the redo record in the redo header (and redo records can have non-unique SCN). Given redo records from two threads (as in the case of RAC), Recovery will order them in SCN order, essentially maintaining a strict sequential order. As explained in my<a href="http://www.orainternals.com/wp-content/uploads/2011/12/Riyaj_redo_internals_and_tuning_by_redo_reduction_doc1.pdf"> paper</a>, every redo record has multiple change vectors too.</li>
<li>Every data block also has block SCN (aka block version). In addition to that, a change vector in a redo record also has expected block SCN. This means that a change vector can be applied to one and only version of the block. Code checks if the target SCN in a change vector is matching with the block SCN before applying the redo record. If there is a mismatch, corruption errors are thrown.</li>
<li>Read consistency also uses SCN. Every query has query environment which includes an SCN at the start of the query. A session can see the transactional changes only if that transaction commit SCN is lower then the query environment SCN.</li>
<li>Commit. Every commit will generate SCN, aka commit SCN, that marks a transaction boundary. Group commits are possible too.</li>
</ol>
<p><strong>SCN format</strong></p>
<p>SCN is a huge number with two components to it: Base and wrap. Wrap is a 16 bit number and base is a 32 bit number. It is of the format wrap.base. When the base exceeds 4 billion, then the wrap is incremented by 1. Essentially, wrap counts the number of  times base wrapped around 4 billion. Few simple SQL script will enumerate this better:</p>
<p>In the SQL statement below, we use dbms_flashback package call to get the current system change number, we also convert that number to hex format to breakdown the SCN.</p>
<pre>col curscn format 99999999999999999999999
select to_char(dbms_flashback.get_system_change_number,'xxxxxxxxxxxxxxxxxxxxxx'),
dbms_flashback.get_system_change_number curscn from dual;
TO_CHAR(DBMS_FLASHBACK. CURSCN
----------------------- ------------------------
280000371 10737419121</pre>
<p>Here, hex value of the SCN is 0&#215;280000371 and  decimal format is 10737419121. Let&#8217;s review the hex value 0&#215;280000371, this value can be split in to two components, better written as 0&#215;2.80000371, where 0&#215;2 is the wrap and 0&#215;80000371 is the hex representation of base. To verify the base and wrap, we can put them back together to get the SCN value. Essentially, multiply wrap by 4 billion and add base to get the SCN in number format. Script shows the output and see that these two numbers are matching.</p>
<pre>col n2 format  99999999999999999999999
select to_number(2,'xxxxxxx') * 4 * power(2,30) + to_number(80000371,'xxxxxxxxxxxxxxxxxxxxxx') n2 from dual
N2
 -------------------
 10737419121</pre>
<p>If you continue the discussion logically, then maximum value of the wrap defines the maximum value of SCN i.e. maximum value of wrap*4 billion = 65536* 4* power(2,30) = 281,474,976,710,656 = 281 trillion values.</p>
<p><strong>Does each change increment SCN?</strong></p>
<p>Not necessarily. The SCN increment is  not for every change. For example, in the script below, we will change the table 1000 times, but the generated SCN will be very few.</p>
<pre>create table  rs.dropme (n1 number , n2 number);
test_case_scn.sql:
--------------cut --------------
col curscn format 99999999999999999999999
select dbms_flashback.get_system_change_number curscn from dual;
begin
 for i in 1 .. 1000
 loop
 insert into rs.dropme values(i, i);
 end loop;
end;
/
select dbms_flashback.get_system_change_number curscn from dual;
------------cut -----------------
alter system switch log file;
SQL&gt; @test_case_scn
  CURSCN
------------------------
10737428262
PL/SQL procedure successfully completed.

CURSCN
------------------------
10737428271
SQL&gt; alter system switch logfile;
System altered.</pre>
<p>Even though there were 1000 changes to the table, just 9 SCNs increased. If we dump the redo record using the script <a href="http://www.orainternals.com/wp-content/uploads/2012/01/dump_last_log_sql.txt">dump_last_log.sql</a> then we can see redo records have both SCN and SUBSCN below too. Many REDO records are having same SCN and SUBSCN combo.</p>
<pre>REDO RECORD - Thread:1 RBA: 0x000010.0000001c.018c LEN: 0x00fc VLD: 0x01
SCN: 0x0002.8000fb87 SUBSCN:  1 01/19/2012 09:14:27
REDO RECORD - Thread:1 RBA: 0x000010.0000001d.0098 LEN: 0x00fc VLD: 0x01
SCN: 0x0002.8000fb87 SUBSCN:  1 01/19/2012 09:14:27
REDO RECORD - Thread:1 RBA: 0x000010.0000001d.0194 LEN: 0x00fc VLD: 0x01
SCN: 0x0002.8000fb87 SUBSCN:  1 01/19/2012 09:14:27
REDO RECORD - Thread:1 RBA: 0x000010.0000001e.00a0 LEN: 0x00fc VLD: 0x01
SCN: 0x0002.8000fb87 SUBSCN:  1 01/19/2012 09:14:27</pre>
<pre>...</pre>
<p><strong>Database link and SCNS</strong></p>
<p>Database link based transactions can cause SCN increases too. For example, let&#8217;s say that, three databases db1, db2, and db3 participate in a distributed transaction and let&#8217;s say that their current SCN is 1000, 2000, 5000 respectively in these databases. At commit time, a co-ordinated SCN is needed for the distributed transaction and maximum SCN value from all participating databases is chosen; SCN value of these three databases will be increased to 5000.</p>
<p><strong>Can you run out of SCN?</strong></p>
<p>As you saw earlier, maximum SCN hard limit is 281 trillion. In addition to that, there is also a soft limit imposed by Oracle code as a protection mechanism. If the next SCN is more than the soft limit, ORA-600[2252] is emitted and the operation cancelled. For example, in the case of database link based distributed transaction, if the co-ordinated SCN is greater than the soft limit ORA-600 emitted.</p>
<p>This soft limit is calculated using the formula (number of seconds from 1/1/1988) * 16384. As the number of seconds from 1/1/1988 is continuously increasing, soft limit is increasing at the rate of 16K per second continuously. Unless, your database is running full steam generating over 16K SCNs, you won&#8217;t run in to that soft limit that easily. [ But, you could create ORA-600[2252] by resetting your server clock to 1/1/1988].</p>
<p>Problem comes if many interconnected databases each generating at higher rate in kind of round-robin fashion.DB1 generates 20K SCNs per second in the first 5 minutes, DB2 generates 20K SCNs per second in the next 5 minutes, DB3 generates 20K SCNs per second in the next 5 minutes etc. In this case, all three Databases will have a sustained 20K SCNs per second rate. Database is slowly catching up to soft limit (1 second per every 4 second exactly) and again, it will take many years for them to catch up to the soft limit assuming the databases are active, continuously. But, there is that  infamous, hated by my client,  hot backup bug.</p>
<p>(BTW, To reach hard limit,  it will take 544 years to run out of SCN at 16K rate normally (65536*4*1024*1024*1024 / 16384 / 60/60/24/365)).</p>
<p>Here is an example of ORA-600 [2252] error. In this example lines printed below, 2838 is the SCN wrap and 395527372 is the SCN base. If we convert this to decimal SCN it is in the 12 Trillion range. Database link based connection was trying to increase the SCN over 12 Trillion value, but it was rejected by the database as the SCN was exceeding the soft limit.</p>
<pre> 
ORA-00600: internal error code, arguments: [2252], [2838], [395527372], [], [], [], [], [], [], [], [], []</pre>
<p>BTW, in 10g, this 16K per second was hard coded. But, 11gR2, this limit is controlled by an underscore parameter _max_reasonable_scn_rate defaulting to 32K.</p>
<p><strong>Hot backup bug</strong></p>
<p>Most DBAs use RMAN to do backup. But, still, there are few databases that use hot backup mode, primarily because of disk mirror based backups. It is a common behavior to see higher SCN rate if the database is altered to hot backup mode. A SGA variable array keeps track of the backup mode at file level. When you alter the database out of backup mode, SGA variables are reset and the higher SCN rate goes back to normal. Due to a bug (12371955), that SGA variable is not reset leaving the database to think that it is still in hot backup mode. Database generates SCN at higher rate. (if you recycle the database later, of course, the variable is reset to normal rate). There is way to dump the SGA variable to check if the database currently thinks if it is in hot backup mode or not.</p>
<p>Due to this bug, an highly active database can create increased SCN rate over 16K. Over a long period of time (in fact, it probably will take many years) the SCN catches up to the soft limit. Once soft limit is reached, next SCN update will throw ORA-660[2252] errors. Of course, this SCN growth is propagated to other databases over database link. As the soft limit calculation is time based, time zone of the server is also important. For example, if the values are close enough to soft limit, then the databases running in US Eastern time zone will have an higher soft limit by (4*60*60*16384 =235 million ) then the databases running in Pacific Time Zone.</p>
<p>Salient points of the bug are:</p>
<ol>
<li>There is <span style="text-decoration: underline;">no corruption</span> danger, sessions might die or the databases might throw ORA-600 errors. In rare cases, databases have to be kept down for few hours or distributed transaction removed from the database so that the head room between the soft limit and the current SCN is widen.</li>
<li>This bug affects only if you use &#8216;ALTER DATABASE&#8217; command. If you use, &#8216;ALTER TABLESPACE&#8217; command for backup, you are not affected by this bug.</li>
<li>SCN rate is also directly relevant to activity. If the database has lower activity, SCN rate is also lower, even when the database is altered to backup mode with this bug.</li>
</ol>
<p>There is a script released by Oracle that can tell you how close your database is to the soft limit,aka SCN headroom. So, first check if your database is having any SCN issue or not, that script is available as bug 13498243 and tells you how many days of SCN headroom you have.</p>
<p><strong>How to check SCN rate?</strong></p>
<p>There are multiple ways to check SCN rate in your database.</p>
<p><span style="text-decoration: underline;">Method 1:</span></p>
<p>smon_scn_time keeps track of the mapping between time and SCN at approximately 5 minutes granularity. That can be used to measure SCN rate, see code below. Although, this is easier to check, remember that there is no easy way to identify if the SCN increase is due to intrinsic activity in the database or is it due to an external database increasing the SCN by a distributed transaction activity. We will discuss this differentiation later.</p>
<pre>with t1 as(
select time_dp , 24*60*60*(time_dp - lag(time_dp) over (order by time_dp)) timediff,
  scn - lag(scn) over(order by time_dp) scndiff
from smon_scn_time
)
select time_dp , timediff, scndiff,
       trunc(scndiff/timediff) rate_per_sec
from t1
order by 1
/
TIME_DP                TIMEDIFF    SCNDIFF RATE_PER_SEC
-------------------- ---------- ---------- ------------
19-JAN-2012 15:23:21        315       2931            9
19-JAN-2012 15:25:46        145        708            4
19-JAN-2012 15:28:00        134       1268            9
19-JAN-2012 15:30:48        168        597            3
19-JAN-2012 15:35:51        303       4148           13
19-JAN-2012 15:36:47         56        103            1
19-JAN-2012 15:42:14        327        671            2</pre>
<p><span style="text-decoration: underline;"><em>Method 2:</em></span></p>
<p>v$log_history also can be used to check the SCN rate of the database. In this code below, you can see the SCN rate per second queried from v$log_history. Even if you are running in RAC, query against v$log_history is sufficient as it holds the archive logs from all threads. If there is a SCN spike, say from a remote database, then you will see a SCN spike in the output of this query below.</p>
<pre>alter session set nls_date_format='DD-MON-YYYY HH24:MI:SS';
col first_change# format 99999999999999999999
col next_change# format 99999999999999999999
select  thread#,  first_time, next_time, first_change# ,next_change#, sequence#,
   next_change#-first_change# diff, round ((next_change#-first_change#)/(next_time-first_time)/24/60/60) rt
from (
select thread#, first_time, first_change#,next_time,  next_change#, sequence#,dest_id from v$archived_log
where next_time &gt; sysdate-30 and dest_id=1
order by next_time
)
order by  first_time, thread#
/

   THREAD# FIRST_TIME                   FIRST_CHANGE#          NEXT_CHANGE#  SEQUENCE#       DIFF         RT
---------- -------------------- --------------------- --------------------- ---------- ---------- ----------
         2 12-JAN-2012 16:10:30              25995867              26026647        308      30780          0
         1 17-JAN-2012 14:05:00              26026649              26028427        555       1778          1
         1 17-JAN-2012 14:05:00              26026649              26028427        555       1778          1
         2 17-JAN-2012 14:05:00              26026647              26028432        309       1785          1
         2 17-JAN-2012 14:05:00              26026647              26028432        309       1785          1
         1 17-JAN-2012 14:27:21              26028427            1073743815        556 1047715388     814076
         2 17-JAN-2012 14:48:48              26028157              26028230          1         73          3
         2 18-JAN-2012 14:22:23              26076103           10737418303          3 1.0711E+10    7448778
         1 18-JAN-2012 14:22:24              26076106           10737427850          5 1.0711E+10    1458319
         1 18-JAN-2012 16:24:49           10737427850           10737427884          6         34          2
         1 18-JAN-2012 16:25:03           10737427884           10737428252          7        368          1</pre>
<p>In the output above,  there was a SCN jump by 10 Billion between 14:27 and 14:05. You can&#8217;t differentiate if that increase came from external systems or is it due to intrinsic activity easily. In this specific case, because this is an extreme SCN increase, and I would guess that it came from external systems. ( But usually this level of SCN increase will not happen in your production site and my example is to just explain the concept).</p>
<p><strong>What happens in RAC?</strong></p>
<p>In RAC, instance that receive the update from external system will increase the SCN of the database SCN to the new higher SCN. When other instances query for next SCN, immediately that SCN increase will be propagated to other instances too.</p>
<p><strong>Can two threads get same SCN?</strong></p>
<p>Obvious answer is No. Correct answer is yes. For example, redo records from two threads shows that they have exact same SCN and subSCN. This is not a problem or concern, as the buffer changes are protected by GCS layer code, and the row changes are protected by locking mechanism.</p>
<pre><span style="text-decoration: underline;">node 1</span>:

REDO RECORD - Thread:1 RBA: 0x000010.0000007f.0114 LEN: 0x0138 VLD: 0x01
SCN: 0x0002.8000fb91 SUBSCN:  1 01/19/2012 09:14:27

<span style="text-decoration: underline;">node 2:</span>
REDO RECORD - Thread:2 RBA: 0x000007.00000003.0010 LEN: 0x0068 VLD: 0x05
SCN: 0x0002.8000fb91 SUBSCN:  1 01/19/2012 09:14:27</pre>
<p><strong>Intrinsic vs Extrinsic SCN growth<br />
</strong></p>
<p>There is a statistic that can also guide us to determine if the SCN increase is intrinsic or extrinsic or not. Statistics &#8216;calls to kcmgas&#8217; gives an approximate number of calls to allocate SCNs. This statistics is an estimate only, not an absolute count of generated SCNs. We will understand this stats with a script and an helper function.</p>
<p><code> create or replace function get_my_statistics (l_stat_name varchar2)<br />
return number as<br />
l_value number;<br />
begin<br />
select ses.value into l_value<br />
from v$sesstat ses , v$statname stat<br />
where stat.statistic#=ses.statistic# and<br />
ses.sid=(select sid from v$mystat where rownum and stat.name = l_stat_name;<br />
return l_value;<br />
end;<br />
/<br />
alter system switch logfile;<br />
host sleep 5<br />
create table rs.dropme (n1 number , n2 number);<br />
col curscn format 99999999999999999999999<br />
select dbms_flashback.get_system_change_number curscn , get_my_statistics('calls to kcmgas') kcmgas from dual;<br />
begin<br />
for i in 1 .. 100000<br />
loop<br />
insert into rs.dropme values(i, i);<br />
end loop;<br />
end;<br />
/<br />
select dbms_flashback.get_system_change_number curscn , get_my_statistics('calls to kcmgas') kcmgas from dual;<br />
alter system switch logfile;<br />
</code></p>
<p>Output of the above script is:</p>
<pre>                  CURSCN     KCMGAS
------------------------ ----------
             10737522265          0
 PL/SQL procedure successfully completed.
                  CURSCN     KCMGAS
------------------------ ----------
             10737523122        826</pre>
<p>From the output, we can see that 857 SCN differences vs 826 kcmgas calls form this session. There could be other background processes generating SCN which would explain this difference. Even at instance level, it doesn&#8217;t match exactly, but multiplying &#8216;kcmgas calls&#8217; statistics by 1.1 gives you better estimate. This method can be used to identify if the SCN growth is intrinsic or extrinsic in a database. It can be also  used to identify the instance generating more SCNs in a RAC cluster or the database generating more SCNs in a complex interconnected environment.</p>
<p><strong>SCN Vulnerability issue</strong></p>
<p>I am not going to discuss details about this vulnerability issue at all. But, this vulnerability require access to production database. DBAs with security in mind, don&#8217;t allow production access that easily anyway. So, In my opinion, it is a problem that must be addressed, but you would need a malicious DBA with expert level knowledge to misuse this vulnerability. Follow Oracle support direction on this one as I usually stay away from talking about security vulnerability issues. Check here for <a href="http://www.oracle.com/technetwork/topics/security/cpujan2012-366304.html">details</a></p>
<p><strong>Summary</strong></p>
<p>I have been holding on publishing this blog entry for many months now. Since this issue is in the public knowledge domain, I can share the knowledge without any repercussions. In a nutshell, understanding SCN generation and intrinsic details about it is important. Armed with scripts, you can review your environment.</p>
<p>Update 1: Correcting some formatting issues, sorry</p>
<p>Update 2: Correcting verbatim to read as &#8220;Essentially, multiply base by 4 billion and add wrap to get the SCN in number format&#8221;</p>
]]></content:encoded>
			<wfw:commentRss>http://www.orainternals.com/2012/01/20/scn-what-why-and-how/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>GC cr disk read</title>
		<link>http://www.orainternals.com/2012/01/13/gc-cr-disk-read/</link>
		<comments>http://www.orainternals.com/2012/01/13/gc-cr-disk-read/#comments</comments>
		<pubDate>Fri, 13 Jan 2012 01:05:25 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Oracle database internals]]></category>
		<category><![CDATA[Performance tuning]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[RAC]]></category>
		<category><![CDATA[down convert]]></category>
		<category><![CDATA[gc cr disk read]]></category>
		<category><![CDATA[light works]]></category>
		<category><![CDATA[oracle performance]]></category>
		<category><![CDATA[RAC performance]]></category>
		<category><![CDATA[RAC performance myths]]></category>

		<guid isPermaLink="false">http://www.orainternals.com/?p=1222</guid>
		<description><![CDATA[You might encounter RAC wait event &#8216;gc cr disk read&#8217; in 11.2 while tuning your applications in RAC environment. Let&#8217;s probe this wait event to understand why a session would wait for this wait event. Understanding the wait event Let&#8217;s say that a foreground process running in node 1, is trying to access a block [...]]]></description>
			<content:encoded><![CDATA[<p>
You might encounter RAC wait event &#8216;gc cr disk read&#8217; in 11.2 while tuning your applications in RAC environment. Let&#8217;s probe this wait event to understand why a session would wait for this wait event.
</p>
<p><strong>Understanding the wait event</strong></p>
<p>
Let&#8217;s say that a foreground process running in node 1, is trying to access a block using a SELECT statement and that block is not in the local cache. To maintain the read consistency, foreground process will require the block consistent with the query SCN. Then the sequence of operation is(simplified):
</p>
<ol>
<li>Foreground session calculates the master node of the block; Requests a LMS process running in the master node to access the block.</li>
<li>Let&#8217;s assume that block is resident in the master node&#8217;s buffer cache. If the block is in a consistent state (meaning block version SCN is lower (or equal?) to query SCN), then LMS process can send the block to the foreground process immediately. Life is not that simple, so, let&#8217;s assume that requested block has an uncommitted transaction.</li>
<li>Since the block has uncommitted changes, LMS process can not send the block immediately. LMS process must create a CR (Consistent Read) version of the block: clones the buffer, applies undo records to the cloned buffer rolling back the block to the SCN consistent with the requested query SCN.</li>
<li>Then the CR block is sent to the foreground process.</li>
</ol>
<p><strong>LMS is a light weight process</strong></p>
<p>
Global cache operations must complete quickly, in the order of milli-seconds, to maintain the overall performance of RAC database. LMS is a critical process and does not do heavy lifting tasks such as disk I/O etc. If LMS process has to initiate I/O, instead of initiating I/O, LMS will downgrade the block mode and send the block to the requesting foreground process (this is known as Light Works rule). Foreground process will apply undo records to the block to construct CR version of the block.
</p>
<p>
Now, the Foreground process might not find the undo blocks in the local cache as the transactions happened in the remote cache. A request is sent to remote LMS process to access undo block. If the undo block is not in the remote cache either, remote LMS process will send a grant to the foreground process to read the undo block from the disk. Foreground process accounts this wait time for the undo segment block grants to the &#8216;gc cr disk read&#8217; wait event.
</p>
<p>
There are other reasons as to why FG process might have to read undo block. One of them is that Fairness downconvert triggered by LMS process. Essentially, if a block is requested too many times leading to many CR block fabrication, then instead of LMS doing more work, LMS process will simply down convert the block,  send the block and grant to the requester an access to the block. FG process will apply undo to construct CR block itself.
</p>
<p>
 gv$cr_block_server can be used to review the number of down converts, Light works etc. But, it is probably not possible to identify the reason for a block down convert after the event.
</p>
<p><strong>Why do we need this new event?</strong></p>
<p>
There is a very good reason why this event was introduced. Prior to 11g, waits for single block CR grants are accounted to wait event such as &#8216;gc cr block 2-way&#8217;, &#8216;gc cr block 3-way&#8217; etc. Waits for grants on remote-instance-undo-blocks for CR fabrication is special, in the sense that, this is an additional unnecessary work from the application point of view. We need to be able to differentiate the amount of time spent waiting for undo block grants for CR fabrication vs other types of grants (such as data blocks etc). So, it looks like, Oracle has introduced this new event and I do think that this will be very useful for debugging performance issues.
</p>
<p>
Prior to 11g, you could still differentiate waits for single block grants for undo using ASH data or trace files. But, you will have to use the obj# field for this differentiation and obj# is set to 0 or -1 in the case of undo blocks/undo header blocks.
</p>
<p><strong>Test case</strong></p>
<p>
Of course, a test case would be nice, Just any regular table will do and my table structure have just two columns number, varchar2(255) with 1000 rows or so.
</p>
<div><code> create table rs.t_1000 (n1 number, v1 varchar2(255));<br />
insert into rs.t_1000 select n1, lpad(n1, 255, 'DEADBEAF') from (select level n1 from dual connect by level &lt;=1000;<br />
commit;<br />
</code></p>
<ol>
<li>  node 1: update rs.t_1000 set v1=v1 where n1=100</li>
<li>   node 2; select * from rs.t_1000 where n1=100 &#8211; just to get parsing details away.</li>
<li>  node 1: alter system flush buffer cache; &#8211;flushed buffer cache.</li>
<li>   node 2: select * from rs.t_1000 where n1=100 &#8212; This SELECT statement suffers from gc cr disk read.</li>
</ol>
<p>
At step 3 in our test case, I flushed the buffer cache in node 2. When I reread the block again in node 1, here is an approximate sequence of operations that occurred:
</p>
<ol>
<li>For SELECT statement in step 4, foreground sent a block request to LMS process in node2; LMS process in node 2 did not find the block in the buffer cache (since we flushed the buffer cache).</li>
<li>So, LMS process in node2 sent a grant to the foreground process to read the data block from disk.</li>
<li>Foreground process read the block from the disk, found that block version is higher than the query environment SCN and that there is a pending transaction in an ITL entry of the block.</li>
<li>Foreground process clones the buffer and  tries to apply undo records in order to reconstruct CR version of the block to match the query environment SCN.</li>
<li>But, that pending transaction was initiated in node 2 and that undo segment is mastered by node 2. So, FG process sends a request for the block to node 2 LMS process. LMS process does not find the undo segment block in the node 2 cache and sends back a grant to read the block from the disk to the FG process. Meanwhile, the FG process is still waiting and the amount of time that FG process was waiting to receive the grant is accounted towards &#8216;gc cr disk read&#8217;.</li>
</ol>
<p>
(I formatted trace lines to improve readability and my comments are inline)
</p>
<p>.
</p></div>
<div></div>
<p><code> PARSING IN CURSOR #18446741324873694136 len=162 dep=0 uid=0 oct=3 lid=0 tim=1103269602 hv=361365006 ad='73fba918' sqlid='7cmy6msasmzhf'<br />
select dbms_rowid.rowid_relative_fno (rowid) fno,<br />
dbms_rowid.rowid_block_number(rowid) block,<br />
dbms_rowid.rowid_object(rowid) obj, v1 from rs.t_1000 where n1=100<br />
END OF STMT<br />
c=0,e=17012,p=0,cr=0,cu=0,mis=1,r=0,dep=0,og=1,plh=479790187,tim=1103269601<br />
c=0,e=12393,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=479790187,tim=1103282129<br />
nam='SQL*Net message to client' ela= 4 driver id=1650815232 #bytes=1 p3=0 obj#=603 tim=1103288000<br />
nam='Disk file operations I/O' ela= 7 FileOperation=2 fileno=4 filetype=2 obj#=603 tim=1103288120<br />
nam='db file sequential read' ela= 786 file#=4 block#=2891 blocks=1 obj#=85844 tim=1103289701<br />
nam='db file sequential read' ela= 560 file#=4 block#=2892 blocks=1 obj#=85844 tim=1103290450<br />
nam='Disk file operations I/O' ela= 4 FileOperation=2 fileno=7 filetype=2 obj#=85844 tim=1103290546<br />
nam='db file sequential read' ela= 542 file#=7 block#=2758 blocks=1 obj#=75154 tim=1103291149<br />
-- RS: Following is for <b>undo header block</b> to find the transaction.<br />
nam='gc cr disk read' ela= 633 p1=6 p2=176 p3=43 obj#=0 tim=1103292048<br />
nam='db file sequential read' ela= 662 file#=6 block#=176 blocks=1 obj#=0 tim=1103292843<br />
-- RS: Following read is for <b>undo block </b>itself to rollback the transaction changes.<br />
nam='gc cr disk read' ela= 483 p1=6 p2=955 p3=44 obj#=0 tim=1103293699<br />
nam='db file sequential read' ela= 569 file#=6 block#=955 blocks=1 obj#=0 tim=1103294355<br />
nam='library cache pin' ela= 1045 handle address=2043988208 pin address=1969440144 100*mode+namespace=48820893384706 obj#=0 tim=1103295827<br />
FETCH :c=0,e=8033,p=5,cr=6,cu=0,mis=0,r=1,dep=0,og=1,plh=479790187,tim=1103296065<br />
WAIT : nam='SQL*Net message from client' ela= 323 driver id=1650815232 #bytes=1 p3=0 obj#=0 tim=1103296522<br />
FETCH :c=0,e=24,p=0,cr=1,cu=0,mis=0,r=0,dep=0,og=1,plh=479790187,tim=1103296586<br />
STAT id=1 cnt=1 pid=0 pos=1 obj=75154 op='TABLE ACCESS BY INDEX ROWID T_1000 (cr=7 pr=5 pw=0 time=6505 us cost=2 size=261 card=1)'<br />
STAT id=2 cnt=1 pid=1 pos=1 obj=85844 op='INDEX RANGE SCAN T_1000_N1 (cr=3 pr=2 pw=0 time=2494 us cost=1 size=0 card=1)'<br />
WAIT : nam='SQL*Net message to client' ela= 1 driver id=1650815232 #bytes=1 p3=0 obj#=0 tim=1103296728<br />
</code></p>
<div></div>
<div><strong>Trace file analysis</strong></div>
<div></div>
<p>Let&#8217;s review few lines from the trace file. Block 2758 holds the row physically. After reading that block, a &#8216;gc cr disk read&#8217; wait event is encountered. Essentially, the FG identified that there is a pending transaction, so sent a request to LMS accounting the wait time to &#8216;gc cr disk read&#8217; event. For this &#8216;gc cr disk read&#8217; event, p1 is file_id, p2 is block_id, p3 seems to be a counter increasing by 1 for each of these waits. For example, for the next gc cr disk read, p3 is set to 44. obj#=0 indicates that this is an undo block or undo header block.
</p>
<div></div>
<div>Notice that next line indicates a physical read for that undo header block occurred and the obj# is set to 0.I verified the undo header block by dumping the block too </div>
<p><code><br />
nam='db file sequential read' ela= 542 file#=7 block#=2758 blocks=1 obj#=75154 tim=1103291149<br />
nam='gc cr disk read' ela= 633 p1=6 p2=176 p3=43 obj#=0 tim=1103292048<br />
nam='db file sequential read' ela= 662 file#=6 block#=176 blocks=1 obj#=0 tim=1103292843</code></p>
<div>
<strong>Commit Cleanout </strong></p>
<p>
Commit cleanouts is another reason why you would encounter this wait event. When a session commits, that session will revisit the modified blocks to clean out the ITL entries in the modified blocks. But, this cleanout does not happen in all scenarios. For example, if the number of block changes exceeds list of blocks that session maintains in SGA then the session will mark the transaction table as committed, without cleaning out ITL entries in the actual blocks. Commit cleanout doesn&#8217;t happen if the modified block is not in buffer cache anymore (We will use this idea to trigger commit cleanout in the other node).
</p>
<p>
Next session reading that block will close out the ITL entry in the block with an upper bound SCN. Let&#8217;s tweak our test case little bit to simulate commit cleanouts.
</p>
<p>
One notable difference in the test case below is that after flushing the buffer cache, we switch back to the session updating the block,commit the transaction and flush the buffer cache again. With these two flush we guarantee that session will not commit cleanout and undo block is flushed from the cache. SELECT statement is executed after the commit. When we read the same block from a different node then the reader session will cleanout the ITL entry. Cleanout operation require to identify the transaction state and max query SCN. To identify whether the transaction is committed or not, session in node2 must read the undo header block. Waits to receive the grants for the undo|undo header block are accounted towards gc cr disk read wait event.
</p>
<ol>
<li>  node 1: update rs.t_1000 set v1=v1 where n1=100</li>
<li>   node 2; select * from rs.t_1000 where n1=100 &#8211; just to get parsing details away.</li>
<li>  node 1: alter system flush buffer cache; &#8211;flushed buffer cache.</li>
<li>   commit in node 1 from the other session. </li>
<li>    node 1: alter system flush buffer cache; &#8211;flushed buffer cache to remove undo blocks from cache.</li>
<li>   node 2: select * from rs.t_1000 where n1=100 .</li>
</ol>
<p><strong>Trace file </strong></p>
<p> My comments are inline </p>
<p><code><br />
...<br />
-- block with the row is read below<br />
WAIT : nam='db file sequential read' ela= 417 file#=7 block#=2758 blocks=1 obj#=75154 tim=783492616<br />
WAIT : nam='Disk file operations I/O' ela= 2 FileOperation=2 fileno=6 filetype=2 obj#=75154 tim=783492751<br />
-- Undo header block is read with time accounted towards gc cr disk read. Note that, there are no other reads after this.<br />
-- In the prior test case that we saw earlier, there was an additional read to read undo block;<br />
--  In this test case, no additional read for undo block as only commits need to be cleaned out.<br />
WAIT : nam='gc cr disk read' ela= 12779 p1=6 p2=160 p3=41 obj#=0 tim=783507346<br />
WAIT : nam='db file sequential read' ela= 778 file#=6 block#=160 blocks=1 obj#=0 tim=783508447<br />
</code><br />
<strong>So, what can we do?</strong></div>
<div>Is there any thing you can do to improve the performance of the application and reduce &#8216;gc cr disk read&#8217; wait events? There are few options that you can consider, some are long-term options though.</div>
<div>
<ol>
<li>Consider Application affinity. If the application is running in the same node that transactions are aggressively modifying the objects, you can reduce/eliminate the grants. Application affinity is important even though cache fusion is in play.</li>
<li>Of course, tune the statement such a way that number of block visits are reduced.</li>
<li>Reduce commit cleanouts. For example, if a program is excessively modifying the block in node 1. Subsequent program is reading the same blocks in node 2, then you may have to cleanout the block manually from node 1 by reading the blocks. BTW, parallel queries do not perform commit cleanouts, only serial queries will perform commit cleanouts. Another operation that can get stuck is index creation. If the table has numerous blocks without commit cleanouts, then index build might be slower and can suffer from gc cr disk read events. This is magnified by the fact that these reads are single block reads and much slower for bulk operations such as index rebuild. Reading through all blocks of the table through SELECT statements might be good enough to complete commit cleanouts.  </li>
<li> Don&#8217;t keep long pending transactions on highly active tables in RAC environment. This typically happens for applications that uses tables as queue to pick up jobs i.e. fnd_concurrent_requests table in EBusiness suite. Proper configuration of concurrent manager, optimal values for sleep and cache might help here.
          </li>
<li> For read only tablespace, you might want to make sure that there are no unnecessary commit cleanout operation, since the block will be never cleaned and so every session will try to clean out the block.
        </li>
</ol>
<p><strong>Summary</strong></p>
<p>In summary, this is an useful event to differentiate CR fabrication performance issues and other performance issues. Using few techniques mentioned here, you can reduce the impact of this event.</p>
<p>&nbsp;</p>
</div>
<div></div>
<div></div>
]]></content:encoded>
			<wfw:commentRss>http://www.orainternals.com/2012/01/13/gc-cr-disk-read/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RMOUG 2012 &#8211; Hello Denver!</title>
		<link>http://www.orainternals.com/2012/01/11/rmoug-2012-hello-denver/</link>
		<comments>http://www.orainternals.com/2012/01/11/rmoug-2012-hello-denver/#comments</comments>
		<pubDate>Tue, 10 Jan 2012 22:03:01 +0000</pubDate>
		<dc:creator>admin</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Performance tuning]]></category>
		<category><![CDATA[Presentations]]></category>
		<category><![CDATA[RAC]]></category>
		<category><![CDATA[oracle]]></category>
		<category><![CDATA[oracle performance]]></category>
		<category><![CDATA[Parallelism RAC]]></category>
		<category><![CDATA[RAC performance]]></category>
		<category><![CDATA[RAC protocols]]></category>

		<guid isPermaLink="false">http://www.orainternals.com/?p=1218</guid>
		<description><![CDATA[On February 14-16, I’ll be at the Colorado Convention Center in Denver, Colorado for RMOUG’s Training Days Conference. This is the largest regional Oracle User Conference in North America and attracts presenters from all around the country and the globe. I’ll be presenting: Presentation Name: Troubleshooting RAC Background Process  Abstract: RAC background process performance is critical to keep the application performance. This session will [...]]]></description>
			<content:encoded><![CDATA[<p><em><span style="font-family: Arial; font-size: x-small;">On February 14-16, I’ll be at the Colorado Convention Center in Denver, Colorado for </span></em><span style="font-family: Arial; font-size: x-small;"><a href="http://www.teamycc.com/RMOUG_2012_Conference/index.html" target="_blank"><em>RMOUG’s Training Days Conference</em></a><em>. This is the largest regional Oracle User Conference in North America and attracts presenters from all around the country and the globe. I’ll be presenting:</em></span></p>
<p><strong><em><span style="font-family: Arial; font-size: x-small;">Presentation Name: Troubleshooting RAC Background Process</span></em></strong></p>
<p><em><span style="font-family: Arial; font-size: x-small;"> </span></em><em><span style="font-family: Arial; font-size: x-small;">Abstract: RAC background process performance is critical to keep the application performance. This session will demo techniques to review the performance of RAC background processes such as LMS, LMD, LMON, etc. using various statistics and UNIX tools. The presentation will also discuss why certain background processes must run in higher priority to maintain the application performance in RAC.</span></em></p>
<p><strong><em><span style="font-family: Arial; font-size: x-small;">Presentation Name: A Kind and Gentle Introduction to RAC</span></em></strong></p>
<p><strong><em><span style="font-family: Arial; font-size: x-small;"> </span></em></strong><em><span style="font-family: Arial; font-size: x-small;">Abstract: This session will introduce basic concepts such as cache fusion, conversion to RAC, protocols for interconnect, general architectural overview, GES layer locks, clusterware, etc. The session will also discuss the srvctl command and demo a few of these commands to improve the understanding.</span></em></p>
<p><strong><em><span style="font-family: Arial; font-size: x-small;">Presentation Name: Parallel Execution in RAC</span></em></strong></p>
<p><strong><em><span style="font-family: Arial; font-size: x-small;"> </span></em></strong><em><span style="font-family: Arial; font-size: x-small;">Abstract: This presentation will start to discuss and demo parallel server allocation, intra, and inter node parallelism aspects. The session will discuss the new parallelism features such as parallel statement queuing, parallel auto dop, and discuss the interaction of those features with RAC. The session will probe a few critical parameters to improve PQ performance in RAC.</span></em></p>
<p><span style="font-family: Arial; font-size: x-small;"><a href="http://www.teamycc.com/RMOUG_2012_Conference/index.html" target="_blank"><em>Click here</em></a><em> for more information or to register for RMOUG’s Training Days.</em></span></p>
]]></content:encoded>
			<wfw:commentRss>http://www.orainternals.com/2012/01/11/rmoug-2012-hello-denver/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

