Log in

No account? Create an account

t3knomanser's Fustian Deposits

Mindbending Puzzle (with bonus mindbending physics)

How Random Babbling Becomes Corporate Policy

run the fuck away

Mad science gone horribly, horribly wrong(or right).

Mindbending Puzzle (with bonus mindbending physics)

Previous Entry Share Next Entry
run the fuck away
I have a puzzle at work. My puzzle works like so.

I have an application that lives in a SQL Server 2005 database. It builds dynamic SQL based on snippets of code stored in DB tables, and then executes that dynsql against two other databases to get a single number back for each query I run.

So, the general flow is that for location X, I lookup the bits of the first query, execute it against a remote database, and get a number back. Then I move onto the next query. Once I have executed all the remote queries, I update a DB table based on the results.

I have one entire class of locations that all use the same basic query with some variations in the where clause. The account numbers differ, things like that. Nothing big between any of the locations. And all of these locations work- except one.

For one location, partway through the chain of queries it executes, it hangs. If I pick it apart and run one query at a time, each query completes in a timely fashion. No one query hangs. If I start re-adding queries back in to the process, I find that at 10 (of 15) queries, it works fine. But when I execute eleven or more queries, it hangs. These queries all execute serially, so it's not a threading issue or something like that.

Now, here's the kicker: it doesn't seem to matter which 11 queries I execute. Any combination of 11 seems to blow up. And it's not the 11th query that hangs- it hangs on query number 6 pretty consistently. I have about 10 other locations that do the same exact thing, and they all work fine. It's just this ONE location that this happens to. The hang appears to happen when accessing our Oracle Financials system only, never the SQL Server Data Warehouse we get some of our data from. It looks like it's hanging on the remote system, not my local SQL Server.

Suffice to say, I and everyone I've shared this with is utterly perplexed. If it were something logical, like the last query being the source of the hang, or other locations also being problematic, or one specific query that was always failing, I'd be able to dig in and fix it. These inconsistent and irregular bugs are the absolute worst.

Your bonus mindbending physics: the first evidence that we may really be living in a 2D space.
  • And it's not the 11th query that hangs- it hangs on query number 6 pretty consistently.

    Resign yourself, Number 6. You will never succeed.
  • I remember in SQL 2000 it was possible to execute queries serially, and not quite release the locks from before, thus locking the db (well, the queries anyway. SQL server is pretty robust)

    I'd focus on #6 is SQL Query analyzer - check your indexes, output the contents of the query to a four column table - ID, start, finish, sql_text. Insert the start time, and the text. Update the finish time.

    When it hangs (true hanging is pretty rare... Give it more time... A LOT more time... 30+ minutes for even smallish recordsets of 100,000 rows) - see what is different between that query and when it doesn't hang. Chances are, since it's dynamic sql, you are referencing some non-indexed group of fields that has a large spattering of different data.

    You might also want to run these queries with only one record in the tables. If it's an empty recordset, and it takes a full second or a half-second to run, then you have problems. Tweak the query.

    And in tweaking the query, try all sorts of things that are "bad coding". I've done some crazy stuff that goes against all best practices advice, and it works better. I had to get some code approved by the DBA's at commercehub, and they wouldn't approve it by reading the query. After I demonstrated (which took MONTHS, so annoying) that my stuff actually did work better, then they scratched their heads and approved it.
    • The issue is: query #6 works fine. Query #6 only breaks when there are more than 11 queries in the procedure, and only for one location. The query itself is absurdly simple, along the lines of: SELECT SUM(foo) FROM table WHERE location=X and account#=Y and month = 12 and year = 2008- no joins or anything that complex. Not that many records either.

      When I run just query 6, it comes back quickly with the right result. When I run queries 1-10, query 6 comes back with a result. When I run queries 1-11, or even 1-5,7-12, it hangs at the sixth one executed, so long as the sixth one in the sequence is hitting Oracle Financials.

      I'm trying to figure out how to get a TSql procedure to break things up into different batches, see if that helps. Maybe wrapping up some extra begins and ends or something.
    • Is it possible to reorder it, such that 6 gets executed after 7 or before 5? My thought is then, study the crap out of 5... something there might be screwing things up.

      How do ALL of them run together (sequentially) at once, when it's just pure sql in sql query analyzer? IE, no dynamic sql?
      • If they're not dynamic SQL, they're not running in SQL server. The whole job of my TSQL is to build queries and ship them off to other servers using OPENQUERY.

        I see what you're saying though- I'd have to write some code to get it to output the code that it's running; I might do that, but I think I've found a lead.
        • So it's not the query that's the problem, but BUILDING the query? If that's the case another long shot is an (in)compatible data type. ie, you'd THINK char and varchar should work together, but they're not for some reason... You could look at the schema itself too.

          I know, I know, a billion places to look... but at least you have a lead! I'll be curious what the final answer is... Keep me posted.
    • do *any* queries previous to #6 go to Oracle Financials?

      Is there any chance that there's something really weird going on at a lower level? The last time I saw something that looked similar to this (identical database queries intermittently failing from one location only) , it was solved when the networking department toggled a router setting involving how the router handled fragmented packets.

  • Have you eliminated any hardware issues? We had something like this happen and it turned out that there was some bad memory in the remote box which would intermittently make things puke.
    • It's too consistent for that- it only happens on this location and only when there's a series of queries with 11 or more steps, even when it hasn't executed that far along the path. It's possible, I guess, that it might be that this location ends up pulling enough data to hit that bad block of memory- but it's pulling less data than some and more data than others.
      • Hmmm.... I guess the only way to test that on your end would be to try 11+ queries that don't hit that box at all.

        I know next to nothing about Ora other than that just about anyone I know that works with it spends a lot of time wishing it in to the 11th circle of hell.
  • Never mind my deleted comment, I saw my idea answered in one of the comments. Missed it on first reading.
  • Hmmm... I'm a bit late to the game, and my database experience is (painfully rusty). However, does
    I lookup the bits of the first query, execute it against a remote database, and get a number back. Then I move onto the next query.
    mean that you run a query, get back a recordset, use the data, dispose of the recordset (aka complete the transaction), and then run the next query? How many tables are involved? How many other people are accessing those tables in the database at the same time? Is the likelihood of concurrent updates on the table(s) low, medium, or high?
Powered by LiveJournal.com