Friday, April 10, 2009

interesting & useful: launchd on os x

http://developer.apple.com/MacOsX/launchd.html

--
Love,
Fritz

Using ActiveRecord :include to perform join filtering is expensive

For the User#first_degree_friend_ids lookup, the AR version looks like this:

self.connections.find(:all, :conditions => ["connections.type IN (?) AND users.active=1", Connection::FRIEND_CONNECTION_TYPES], :order => "users.id ASC", :include => :to).collect(&:to_id).uniq
( a work of art in a single line of Ruby code ;-))

My refactored version using raw SQL looks like this:
quoted_types = Connection::FRIEND_CONNECTION_TYPES.map{|type| "'#{type}'"}
sql = "SELECT DISTINCT(c.to_id) " +
"FROM connections c LEFT JOIN users u ON (c.to_id=u.id) " +
"WHERE c.from_id=#{self.id.to_i} AND c.type IN (#{quoted_types.join(',')}) AND u.active=1 " +
"AND c.to_id NOT IN (SELECT blockee_id FROM blockings WHERE blocker_id=#{self.id.to_i}) AND c.to_id!=#{self.id.to_i}"
Connection.connection.select_values(sql).map(&:to_i)

You'll notice that the SQL version also handled blockings natively now.

Now for the benchmarking using Max's user account with 406 friends:

For the ActiveRecord version:
>> Benchmark.bm(1) { |x| x.report{u.ar_first_degree_friend_ids} }
user system total real
0.400000 0.120000 0.520000 ( 0.615262)

For the SQL version:
>> Benchmark.bm(1) { |x| x.report{max.first_degree_friend_ids} }
user system total real
0.000000 0.000000 0.000000 ( 0.019606)

The performance difference is pretty obvious. This change is not only more correct, but will be 30x faster. We should see a significant performance improvement in cache priming and lazy cache loading as a result.

Thursday, April 9, 2009

Wednesday, April 8, 2009

Log rotation best practices

This is definitely the best writeup about the problems that exist with the conveyor-belt-style log rotation driven by logrotate and most default log rotation schemes, and explains a lot of the frustrations with how our logging currently works:

http://sial.org/howto/logging/rotate/

Their solution suggestion is pretty clear; just switch to logging based on time-based buckets in the first place, so you don't need to do conveyor belt rotation and can always find where the right logs are.

"given the choice of a logrotate scheme, I would first recommend time based buckets, second a startup time scheme, and never hindsight conveyer belt rotation."

Rackspace's approach to logs and analysis over them

Rackspace gradually improve their inhouse logging through a few phases, eventually ended up forwarding logs to a distributed filesystem that ran hadoop on top of it, and could distribute an analysis mapreduce task against the logs to answer pretty much anything in a deterministic runtime.

See:

http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data

http://blog.racklabs.com/?p=66

Tuesday, April 7, 2009

wrt magic labels in fixtures

IF you specify an id for a fixture THEN the magical labeling doesn't work:

# conversations.yml
woodchuck_capabilities:
  id: 1
...

# channels.yml
woodchuck_channel:
  conversation: :woodchuck_capabilities   # <---- NOPE
  conversation_id: 1                      # <---- YEP

I knew that the magic couldn't span databases, but didn't know
that specifying an id disabled that behavior...

--
Love,
Fritz

Friday, April 3, 2009

useful: schema dump in code

For example:

ActiveRecord::SchemaDumper.dump(Operation.connection)

--
Love,
Fritz

Re: FYI: poor mysql query planning when ordering by id

I think these order by id clauses everywhere were frequently put in place to get consistent time based ordering of entries in the test suite when several things may be created at nearly the same time. Sucks that mysql isn't optimizing them very well. Regardless, it probably wasn't a very 'correct' choice (eg. if we every move to a sharded data storage scheme). We should probably move towards a more 'correct' clause: "order by created_at, id". Since it indicates what we want (created_at order) but provides a fallback for consistent results in the test suite (falling back to id).

rp



On Fri, Apr 3, 2009 at 12:34 AM, Nathan Stoll <nathan@themechanicalzoo.com> wrote:
Geesh.  That's nice work mysql optimizer!

On Thu, Apr 2, 2009 at 7:48 PM, Fritz Schneider
<fritz@themechanicalzoo.com> wrote:
> Paying attention to slow queries today Bob, Nick, and I saw a number of
> seemingly innocuous-looking sql statements that were taking a ridiculously
> long time to execute. For example:
>
> SELECT * FROM `msgs` WHERE address_id = 11295 ORDER BY id DESC LIMIT 1;
>
> There's an index on address_id so there's no reason this should take up to 8
> seconds! Explaining itl:
>
> mysql> explain SELECT * FROM `msgs` WHERE address_id = 11295 ORDER BY id
> DESC LIMIT 1;
> +----+-------------+-------+-------+--------------------------------------------------------------------------------------------------+---------+---------+------+-------+-------------+
> | id | select_type | table | type  |
> possible_keys
> | key     | key_len | ref  | rows  | Extra       |
> +----+-------------+-------+-------+--------------------------------------------------------------------------------------------------+---------+---------+------+-------+-------------+
> |  1 | SIMPLE      | msgs  | index |
> index_msgs_on_address_id_and_incoming,index_msgs_on_address_id_and_recognized_cmd_and_created_at
> | PRIMARY | 4       | NULL | 75525 | Using where |
> +----+-------------+-------+-------+--------------------------------------------------------------------------------------------------+---------+---------+------+-------+-------------+
> 1 row in set (0.00 sec)
>
> It's using the id as the key for this query because of the order by,
> resulting in a 75k row scan. Adding an additional order by (should be
> equivalent) solves the problem:
>
> mysql> explain SELECT * FROM `msgs` WHERE (`msgs`.address_id = 11295) ORDER
> BY id DESC, created_at DESC LIMIT 1;
> +----+-------------+-------+------+--------------------------------------------------------------------------------------------------+---------------------------------------+---------+-------+------+-----------------------------+
> | id | select_type | table | type |
> possible_keys
> | key                                   | key_len | ref   | rows |
> Extra                       |
> +----+-------------+-------+------+--------------------------------------------------------------------------------------------------+---------------------------------------+---------+-------+------+-----------------------------+
> |  1 | SIMPLE      | msgs  | ref  |
> index_msgs_on_address_id_and_incoming,index_msgs_on_address_id_and_recognized_cmd_and_created_at
> | index_msgs_on_address_id_and_incoming | 5       | const |    9 | Using
> where; Using filesort |
> +----+-------------+-------+------+--------------------------------------------------------------------------------------------------+---------------------------------------+---------+-------+------+-----------------------------+
>
> I'm fixing up the two specific places I noticed this (msgs,
> routing_suggestion_requests), but it's something to be on the lookout for
> because we order by id all over the place.
>
> I'll be paying much more attention to slow queries now that we have splunk
> in place.
>
> --
> Love,
> Fritz
>
>

Re: FYI: poor mysql query planning when ordering by id

Geesh. That's nice work mysql optimizer!

On Thu, Apr 2, 2009 at 7:48 PM, Fritz Schneider
<fritz@themechanicalzoo.com> wrote:
> Paying attention to slow queries today Bob, Nick, and I saw a number of
> seemingly innocuous-looking sql statements that were taking a ridiculously
> long time to execute. For example:
>
> SELECT * FROM `msgs` WHERE address_id = 11295 ORDER BY id DESC LIMIT 1;
>
> There's an index on address_id so there's no reason this should take up to 8
> seconds! Explaining itl:
>
> mysql> explain SELECT * FROM `msgs` WHERE address_id = 11295 ORDER BY id
> DESC LIMIT 1;
> +----+-------------+-------+-------+--------------------------------------------------------------------------------------------------+---------+---------+------+-------+-------------+
> | id | select_type | table | type  |
> possible_keys
> | key     | key_len | ref  | rows  | Extra       |
> +----+-------------+-------+-------+--------------------------------------------------------------------------------------------------+---------+---------+------+-------+-------------+
> |  1 | SIMPLE      | msgs  | index |
> index_msgs_on_address_id_and_incoming,index_msgs_on_address_id_and_recognized_cmd_and_created_at
> | PRIMARY | 4       | NULL | 75525 | Using where |
> +----+-------------+-------+-------+--------------------------------------------------------------------------------------------------+---------+---------+------+-------+-------------+
> 1 row in set (0.00 sec)
>
> It's using the id as the key for this query because of the order by,
> resulting in a 75k row scan. Adding an additional order by (should be
> equivalent) solves the problem:
>
> mysql> explain SELECT * FROM `msgs` WHERE (`msgs`.address_id = 11295) ORDER
> BY id DESC, created_at DESC LIMIT 1;
> +----+-------------+-------+------+--------------------------------------------------------------------------------------------------+---------------------------------------+---------+-------+------+-----------------------------+
> | id | select_type | table | type |
> possible_keys
> | key                                   | key_len | ref   | rows |
> Extra                       |
> +----+-------------+-------+------+--------------------------------------------------------------------------------------------------+---------------------------------------+---------+-------+------+-----------------------------+
> |  1 | SIMPLE      | msgs  | ref  |
> index_msgs_on_address_id_and_incoming,index_msgs_on_address_id_and_recognized_cmd_and_created_at
> | index_msgs_on_address_id_and_incoming | 5       | const |    9 | Using
> where; Using filesort |
> +----+-------------+-------+------+--------------------------------------------------------------------------------------------------+---------------------------------------+---------+-------+------+-----------------------------+
>
> I'm fixing up the two specific places I noticed this (msgs,
> routing_suggestion_requests), but it's something to be on the lookout for
> because we order by id all over the place.
>
> I'll be paying much more attention to slow queries now that we have splunk
> in place.
>
> --
> Love,
> Fritz
>
>

Thursday, April 2, 2009

FYI: poor mysql query planning when ordering by id

Paying attention to slow queries today Bob, Nick, and I saw a number of seemingly innocuous-looking sql statements that were taking a ridiculously long time to execute. For example:

SELECT * FROM `msgs` WHERE address_id = 11295 ORDER BY id DESC LIMIT 1;

There's an index on address_id so there's no reason this should take up to 8 seconds! Explaining itl:

mysql> explain SELECT * FROM `msgs` WHERE address_id = 11295 ORDER BY id DESC LIMIT 1;
+----+-------------+-------+-------+--------------------------------------------------------------------------------------------------+---------+---------+------+-------+-------------+
| id | select_type | table | type  | possible_keys                                                                                    | key     | key_len | ref  | rows  | Extra       |
+----+-------------+-------+-------+--------------------------------------------------------------------------------------------------+---------+---------+------+-------+-------------+
|  1 | SIMPLE      | msgs  | index | index_msgs_on_address_id_and_incoming,index_msgs_on_address_id_and_recognized_cmd_and_created_at | PRIMARY | 4       | NULL | 75525 | Using where |
+----+-------------+-------+-------+--------------------------------------------------------------------------------------------------+---------+---------+------+-------+-------------+
1 row in set (0.00 sec)

It's using the id as the key for this query because of the order by, resulting in a 75k row scan. Adding an additional order by (should be equivalent) solves the problem:

mysql> explain SELECT * FROM `msgs` WHERE (`msgs`.address_id = 11295) ORDER BY id DESC,
created_at DESC LIMIT 1;
+----+-------------+-------+------+--------------------------------------------------------------------------------------------------+---------------------------------------+---------+-------+------+-----------------------------+
| id | select_type | table | type | possible_keys                                                                                    | key                                   | key_len | ref   | rows | Extra                       |
+----+-------------+-------+------+--------------------------------------------------------------------------------------------------+---------------------------------------+---------+-------+------+-----------------------------+
|  1 | SIMPLE      | msgs  | ref  | index_msgs_on_address_id_and_incoming,index_msgs_on_address_id_and_recognized_cmd_and_created_at | index_msgs_on_address_id_and_incoming | 5       | const |    9 | Using where; Using filesort |
+----+-------------+-------+------+--------------------------------------------------------------------------------------------------+---------------------------------------+---------+-------+------+-----------------------------+

I'm fixing up the two specific places I noticed this (msgs, routing_suggestion_requests), but it's something to be on the lookout for because we order by id all over the place.

I'll be paying much more attention to slow queries now that we have splunk in place.

--
Love,
Fritz

"Migrating from svn to DVCS"

This is by far the best SVN-vs-X post I've seen for any given DVCS, and it does it for all of the significant ones. Really impressive, and very helpful/informative if we ever consider switching away from svn.

Migrating from svn to a distributed VCS

Wednesday, April 1, 2009

memoize & Aws::S3

FYI the ActiveSupport memoize feature doesn't play well with Aws::S3. The s3 library adds Kernel#memoize, which conflicts with ActiveSupport::Memoizable#memoize.  This is the second time I've tried to use active support's memoize but hit problems and ended up doing the memoization myself.

--
Love,
Fritz