0.9.8.10
 All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Modules Pages
Public Member Functions | Private Member Functions | Private Attributes | List of all members
Hypertable::AccessGroupGarbageTracker Class Reference

Tracks access group garbage and signals when collection is needed. More...

#include <AccessGroupGarbageTracker.h>

Public Member Functions

 AccessGroupGarbageTracker (PropertiesPtr &props, CellCacheManagerPtr &cell_cache_manager, AccessGroupSpec *ag_spec)
 Constructor. More...
 
void update_schema (AccessGroupSpec *ag_spec)
 Updates control variables from access group schema definition. More...
 
bool check_needed (time_t now)
 Signals if garbage collection is likely needed. More...
 
bool collection_needed (double total, double garbage)
 Determines if garbage collection is actually needed. More...
 
void adjust_targets (time_t now, double total, double garbage)
 Adjusts targets based on measured garbage. More...
 
void adjust_targets (time_t now, MergeScannerAccessGroup *mscanner)
 Adjusts targets using statistics from a merge scanner used in a GC compaction. More...
 
void update_cellstore_info (std::vector< CellStoreInfo > &stores, time_t t=0, bool collection_performed=true)
 Updates stored data statistics from current set of CellStores. More...
 
void output_state (std::ofstream &out, const std::string &label)
 Prints a human-readable representation of internal state to an output stream. More...
 

Private Member Functions

int64_t memory_accumulated_since_collection ()
 Computes the amount of in-memory data accumulated since last collection. More...
 
int64_t total_accumulated_since_collection ()
 Computes the total amount of data accumulated since last collection. More...
 
int64_t compute_delete_count ()
 Computes number of delete records in access group. More...
 
bool check_needed_deletes ()
 Signals if GC is likely needed due to MAX_VERSIONS or deletes. More...
 
bool check_needed_ttl (time_t now)
 Signals if GC is likeley needed due to TTL. More...
 

Private Attributes

std::mutex m_mutex
 Mutex to serialize access to data members More...
 
CellCacheManagerPtr m_cell_cache_manager
 Cell cache manager More...
 
double m_garbage_threshold
 Fraction of accumulated garbage that triggers collection. More...
 
time_t m_elapsed_target {}
 Elapsed seconds required before signaling TTL GC likely needed (adaptive) More...
 
time_t m_elapsed_target_minimum {}
 Minimum elapsed seconds required before signaling TTL GC likely needed. More...
 
time_t m_last_collection_time {0}
 Time of last garbage collection More...
 
uint32_t m_stored_deletes {}
 Number of delete records accumulated in cell stores. More...
 
int64_t m_stored_expirable {}
 Amount of data accumulated in cell stores that could expire due to TTL. More...
 
int64_t m_last_collection_disk_usage {}
 Disk usage at the time the last garbage collection was performed. More...
 
int64_t m_current_disk_usage {}
 Current disk usage, updated by update_cellstore_info() More...
 
int64_t m_accum_data_target {}
 Amount of data to accummulate before signaling GC likely needed (adaptive) More...
 
int64_t m_accum_data_target_minimum {}
 Minimum amount of data to accummulate before signaling GC likely needed. More...
 
time_t m_min_ttl {}
 Minimum TTL found in access group schema. More...
 
bool m_have_max_versions {}
 true if any column families have non-zero MAX_VERSIONS More...
 
bool m_in_memory {}
 true if access group is in memory More...
 

Detailed Description

Tracks access group garbage and signals when collection is needed.

This class is used to heuristically estimate how much garbage has accumulated in the access group and will signal when collection is needed. The Hypertable.RangeServer.AccessGroup.GarbageThreshold.Percentage property defines the percentage of accumulated garbage in the access group that should trigger garbage collection. The algorithm will signal that garbage collection is needed under the following circumstances:

  1. If any of the column families in the access group has a non-zero MAX_VERSIONS or there exists any delete records, and enough data (heuristically determined) has accumulated since the last collection.
  2. If any of the column families in the access group has a non-zero TTL, and the amount of the expirable data from the cell stores plus the in-memory data (cell cache) accumulated since the last collection represents a percentage of the overall access group size that is greater than or equal to the garbage threshold, and enough time (heuristically determined) has elapsed since the last collection.

The following code illustrates how to use this class. Priodically, the member function check_needed() should be called to check whether or not garbage collection may be needed, for example:

if (garbage_tracker.check_needed(now))
  schedule_compaction();

Then in the compaction routine, the actual garbage should be measured before proceeding with the compaction, for example:

if (garbage_tracker.check_needed(now)) {
  measure_garbage(&total, &garbage);
  garbage_tracker.adjust_targets(now, total, garbage);
  if (!garbage_tracker.collection_needed(total, garbage))
    abort_compaction();
}

The next step of the compaction routine is to perform the compaction:

MergeScannerAccessGroup *mscanner = new MergeScannerAccessGroup ...
while (scanner->get(key, value)) {
  ...
  scanner->forward();
}

At this point, the merge scanner should be passed into adjust_targets() to adjust the targets based on the statistics collected during the merge:

garbage_tracker.adjust_targets(now, mscanner);

Finally, in the compaction routine, after the call to adjust_targets(), it is safe to drop the immutable cache or merge it back into the regular cache as is the case with in memory compactions. At the end of the compaction routine, once the set of cell stores has been updated, the update_cellstore_info() routine must be called to properly update the state of the garbage tracker. For example:

bool gc_compaction = (mscanner->get_flags() &
                      MergeScannerAccessGroup::RETURN_DELETES) == 0;
garbage_tracker.update_cellstore_info(stores, now, gc_compaction);

Definition at line 109 of file AccessGroupGarbageTracker.h.

Constructor & Destructor Documentation

AccessGroupGarbageTracker::AccessGroupGarbageTracker ( PropertiesPtr props,
CellCacheManagerPtr cell_cache_manager,
AccessGroupSpec ag_spec 
)

Constructor.

Initializes m_garbage_threshold to the Hypertable.RangeServer.AccessGroup.GarbageThreshold.Percentage property converted into a fraction. Initializes m_accum_data_target and m_accum_data_target_minimum to 10% and 5% of the Hypertable.RangeServer.Range.SplitSize property, respectively. Then calls update_schema().

Parameters
propsConfiguration properties
cell_cache_managerCell cache manager
ag_specAccess group specification

Definition at line 42 of file AccessGroupGarbageTracker.cc.

Member Function Documentation

void AccessGroupGarbageTracker::adjust_targets ( time_t  now,
double  total,
double  garbage 
)

Adjusts targets based on measured garbage.

This function checks to see if the heuristic guess as to whether garbage collection is needed, check_needed(), matches the actual need as computed by garbage / total >= m_garbage_threshold. If they match, then no adjustment is neccessary and the function returns. Otherwise, it will adjust m_accum_data_target and/or m_elapsed_target, if necessary.

An adjustment of m_accum_data_target is needed if there exists a non-zero MAX_VERSIONS or a delete record exists (compute_delete_count() returns a non-zero value), and the garbage collection need as reported by check_needed_deletes() does not match the actual need. The m_accum_data_target value will be adjusted using the following computation:

(total_accumulated_since_collection() * m_garbage_threshold)
  / measured_garbage_ratio

If GC is not needed (but the check indicated that it was), then the value of the above computation is multiplied by 1.15 which avoids micro adjustments leading to a flurry of unnecessary garbage measurements as the amount of garbage gets close to the threshold. If the adjustment results in an increase, it is limited to double the current value and if the adjustment results in a decrease, it is lowered to no less than m_accum_data_target_minimum.

An adjustment of m_elapsed_target is needed if m_min_ttl is non-zero and the garbage collection need as reported by check_needed_ttl() does not match the actual need. The m_elapsed_target value will be adjusted using the following computation:

time_t elapsed_time = now - m_last_collection_time
(elapsed_time * m_garbage_threshold) / measured_garbage_ratio

If GC is not needed (but the check indicated that it was), then the value of the above computation is multiplied by 1.15 which avoids micro adjustments leading to a flurry of unnecessary garbage measurements as the amount of garbage gets close to the threshold. If the adjustment results in an increase, it is limited to double the current value and if the adjustment results in a decrease, it is lowered to no less than m_elapsed_target_minimum.

Parameters
nowCurrent time to be used in elapsed time calculation
totalMeasured number of bytes in access group
garbageMeasured amount of garbage in access group

Definition at line 135 of file AccessGroupGarbageTracker.cc.

void AccessGroupGarbageTracker::adjust_targets ( time_t  now,
MergeScannerAccessGroup mscanner 
)

Adjusts targets using statistics from a merge scanner used in a GC compaction.

This member function first checks mscanner to see if it was a GC compaction by checking its flags for the absence of the MergeScannerAccessGroup::RETURN_DELETES, flag and if so, it retrieves the i/o statistics from mscanner to determine the overall size and amount of garbage removed during the merge scan and then calls adjust_targets

Parameters
nowCurrent time to be used in elapsed time calculation
mscannerMerge scanner used in a GC compaction

Definition at line 124 of file AccessGroupGarbageTracker.cc.

bool AccessGroupGarbageTracker::check_needed ( time_t  now)

Signals if garbage collection is likely needed.

Returns true if check_needed_deletes() or check_needed_ttl() returns true, false otherwise. This function will return false unconditionally until m_last_collection_time is initialized with a call to update_cellstore_info() which is the point at which the tracker state has been properly initialized.

Parameters
nowCurrent time
Returns
true if garbage collection is likely needed, false otherwise

Definition at line 115 of file AccessGroupGarbageTracker.cc.

bool AccessGroupGarbageTracker::check_needed_deletes ( )
private

Signals if GC is likely needed due to MAX_VERSIONS or deletes.

This method computes the amount of data that has accumulated since the last collection by adding the data accumulated on disk, m_current_disk_usage - m_last_collection_disk_usage, with the in-memory data accumulated, memory_accumulated_since_collection(). It then returns true if m_have_max_versions is true or compute_delete_count() returns a non-zero value, and the amount of data that has accumulated since the last collection is greater than or equal to m_accum_data_target.

Returns
true if collection may be needed due to MAX_VERSIONS or delete records, false otherwise

Definition at line 216 of file AccessGroupGarbageTracker.cc.

bool AccessGroupGarbageTracker::check_needed_ttl ( time_t  now)
private

Signals if GC is likeley needed due to TTL.

This member function will return true if m_min_ttl is non-zero, and the amount of the expirable data from the cell stores, m_stored_expirable, plus the in-memory data accumulated since the last collection, memory_accumulated_since_collection(), represents a percentage of the overall access group size that is greater than or equal to the garbage threshold (m_garbage_threshold), and the time that has elapsed since the last collection is greater than or equal to m_elapsed_target.

Parameters
nowCurrent time
Returns
true if collection may be needed due to TTL, false otherwise.

Definition at line 223 of file AccessGroupGarbageTracker.cc.

bool Hypertable::AccessGroupGarbageTracker::collection_needed ( double  total,
double  garbage 
)
inline

Determines if garbage collection is actually needed.

Measures the fraction of actual garbage, garbage / total, in the access group and compares it to m_garbage_threshold. If the measured garbage meets or exceeds the threshold, then true is returned.

Parameters
totalMeasured number of bytes in access group
garbageMeasured amount of garbage in access group
Returns
true if garbage collection is needed, false otherwise.

Definition at line 156 of file AccessGroupGarbageTracker.h.

int64_t AccessGroupGarbageTracker::compute_delete_count ( )
private

Computes number of delete records in access group.

This method computes the number of delete records that exist by adding m_stored_deletes with the deletes from the immutable cache, if it exists, or all deletes reported by the cell cache manager, otherwise.

Returns
number of deletes records in access group

Definition at line 207 of file AccessGroupGarbageTracker.cc.

int64_t AccessGroupGarbageTracker::memory_accumulated_since_collection ( )
private

Computes the amount of in-memory data accumulated since last collection.

If an immutable cache has been installed, then the accumulated memory is the logical size of the immutable cache, otherwise, it is the logical size returned by the cell cache manager. If the access group is in memory, then m_last_collection_disk_usage is subtracted since all of the access group data is held in memory and we only want what's accumulated since the last collection.

Returns
Amount of in-memory data accumulated since last collection

Definition at line 189 of file AccessGroupGarbageTracker.cc.

void AccessGroupGarbageTracker::output_state ( std::ofstream &  out,
const std::string &  label 
)

Prints a human-readable representation of internal state to an output stream.

This function prints a human readable representation of the tracker state to the output stream out. Each state variable is formatted as follows:

<label> '\t' <name> '\t' <value> '\n'
Parameters
outOutput stream on which to print state
labelString label to print at beginning of each line.

Definition at line 94 of file AccessGroupGarbageTracker.cc.

int64_t AccessGroupGarbageTracker::total_accumulated_since_collection ( )
private

Computes the total amount of data accumulated since last collection.

This function computes the total amount of data accumulated since the last collection, including data that was persisted to disk due to minor compactions. It computes the total by adding the value returned by memory_accumulated_since_collection() and adding to it m_current_disk_usage - m_last_collection_disk_usage.

Returns
Total amount of data accumulated since last collection

Definition at line 200 of file AccessGroupGarbageTracker.cc.

void AccessGroupGarbageTracker::update_cellstore_info ( std::vector< CellStoreInfo > &  stores,
time_t  t = 0,
bool  collection_performed = true 
)

Updates stored data statistics from current set of CellStores.

This method updates the m_stored_expirable, m_stored_deletes, and m_current_disk_usage variables by summing the corresponding values from the cell stores in stores. The disk usage is computed as the uncompressed disk usage. If the access group is in memory, then the disk usage is taken to be the logical size as reported by the cell cache manager. If collection_performed is set to true, then m_last_collection_time is set to t and m_last_collection_disk_usage is set to the disk usage as computed in the previous step.

Parameters
storesCurrent set of CellStores
tTime to use to update m_last_collection_time
collection_performedtrue if new cell stores are the the result of a GC compaction

Definition at line 74 of file AccessGroupGarbageTracker.cc.

void AccessGroupGarbageTracker::update_schema ( AccessGroupSpec ag_spec)

Updates control variables from access group schema definition.

This method sets m_have_max_versions to true if any of the column families in the schema has non-zero max_versions, and sets m_min_ttl to the minimum of the TTL values found in the column families, and sets m_elapsed_target_minimum and m_elapsed_target to 10% of the minimum TTL encountered. This function should be called whenever the access group's schema changes.

Parameters
ag_specAccess group specification

Definition at line 55 of file AccessGroupGarbageTracker.cc.

Member Data Documentation

int64_t Hypertable::AccessGroupGarbageTracker::m_accum_data_target {}
private

Amount of data to accummulate before signaling GC likely needed (adaptive)

Definition at line 335 of file AccessGroupGarbageTracker.h.

int64_t Hypertable::AccessGroupGarbageTracker::m_accum_data_target_minimum {}
private

Minimum amount of data to accummulate before signaling GC likely needed.

Definition at line 338 of file AccessGroupGarbageTracker.h.

CellCacheManagerPtr Hypertable::AccessGroupGarbageTracker::m_cell_cache_manager
private

Cell cache manager

Definition at line 306 of file AccessGroupGarbageTracker.h.

int64_t Hypertable::AccessGroupGarbageTracker::m_current_disk_usage {}
private

Current disk usage, updated by update_cellstore_info()

Definition at line 331 of file AccessGroupGarbageTracker.h.

time_t Hypertable::AccessGroupGarbageTracker::m_elapsed_target {}
private

Elapsed seconds required before signaling TTL GC likely needed (adaptive)

Definition at line 313 of file AccessGroupGarbageTracker.h.

time_t Hypertable::AccessGroupGarbageTracker::m_elapsed_target_minimum {}
private

Minimum elapsed seconds required before signaling TTL GC likely needed.

Definition at line 316 of file AccessGroupGarbageTracker.h.

double Hypertable::AccessGroupGarbageTracker::m_garbage_threshold
private

Fraction of accumulated garbage that triggers collection.

Definition at line 309 of file AccessGroupGarbageTracker.h.

bool Hypertable::AccessGroupGarbageTracker::m_have_max_versions {}
private

true if any column families have non-zero MAX_VERSIONS

Definition at line 344 of file AccessGroupGarbageTracker.h.

bool Hypertable::AccessGroupGarbageTracker::m_in_memory {}
private

true if access group is in memory

Definition at line 347 of file AccessGroupGarbageTracker.h.

int64_t Hypertable::AccessGroupGarbageTracker::m_last_collection_disk_usage {}
private

Disk usage at the time the last garbage collection was performed.

Definition at line 328 of file AccessGroupGarbageTracker.h.

time_t Hypertable::AccessGroupGarbageTracker::m_last_collection_time {0}
private

Time of last garbage collection

Definition at line 319 of file AccessGroupGarbageTracker.h.

time_t Hypertable::AccessGroupGarbageTracker::m_min_ttl {}
private

Minimum TTL found in access group schema.

Definition at line 341 of file AccessGroupGarbageTracker.h.

std::mutex Hypertable::AccessGroupGarbageTracker::m_mutex
private

Mutex to serialize access to data members

Definition at line 303 of file AccessGroupGarbageTracker.h.

uint32_t Hypertable::AccessGroupGarbageTracker::m_stored_deletes {}
private

Number of delete records accumulated in cell stores.

Definition at line 322 of file AccessGroupGarbageTracker.h.

int64_t Hypertable::AccessGroupGarbageTracker::m_stored_expirable {}
private

Amount of data accumulated in cell stores that could expire due to TTL.

Definition at line 325 of file AccessGroupGarbageTracker.h.


The documentation for this class was generated from the following files: