OST failed to generate for large ORG

Jacek Nakonieczny reporter

attached ic.log

2016-01-14T09:13:13+00:00

Jacek Nakonieczny reporter

marked as critical

2016-01-14T09:28:48+00:00

Scott Wells repo owner

Jacek, I'll take a look very soon and let you know what I find. I see that you downgraded the severity. Are you able to use the IDE right now?

2016-01-14T15:37:43+00:00

Scott Wells repo owner

Jacek, I see two things here. First, it appears that the size of this org is causing IntelliJ IDEA to get pretty tight on memory:

High memory usage (free 58 of 494 MB) while dumping threads to C:\Users\nakoniej\.IdeaIC15\system\log\threadDumps-20160113-153551-IC-143.1184\20160114-092746\threadDump-20160114-092746.txt

For that you might bump the max heap size of IntelliJ IDEA from the default to 2-3x that to accommodate this org. In case you're not sure how to do that:

https://www.jetbrains.com/idea/help/increasing-memory-heap.html

The second thing I see is that it's taking 20 minutes when trying to query the class identifiers (overlays) for methods:

2016-01-14 09:17:58,718 [63732372]   INFO - loud.symtab.OfflineSymbolTable - Querying class identifiers for Method
...
2016-01-14 09:37:07,504 [64881158]   WARN - .IlluminatedCloudExceptionUtil - Unmarshalling Error: Connection reset

I'm looking at ways I might be able to handle that, but I'm querying these in small batches of 100 at a time. It's almost as if the SOQL query for this information is ignoring the filter in the WHERE clause which is based on the Id field that is guaranteed to be indexed. I may need to follow up with Salesforce on that one, but first let's try two things.

The first thing is what I said above about raising the heap size. If the host JVM is thrashing, that could be slowing down everything enough to make this run much longer than it should from the client side. Perhaps bumping the memory ceiling will just allow it to finish. If not, I have some small tactical changes I've made to that query and the associated logging that might help and minimally should give me more diagnostic info. If it still fails with the higher heap size, let me know and I'll get that to you.

Let me know! Sorry for the issues...

2016-01-14T16:15:55+00:00

Scott Wells repo owner

Jacek, I'm going to make a build of the plugin available to you with these small diagnostic and efficiency changes mentioned above. If you haven't already done so, please wait to try again with a larger heap size until I make this available so that we can get more diagnostic info from the runtime, especially since it takes so long against this org. I should have it ready in the next 15 mins.

2016-01-14T16:28:21+00:00

Scott Wells repo owner

Okay, I have a build available for download here:

https://drive.google.com/file/d/0B2eoK6ID4_FeMHlVQ0NTdE0yTDA/view?usp=sharing

Once you download that, you can install it using Settings>Plugins>Install plugin from disk and then let the IDE restart.

2016-01-14T16:37:41+00:00

Jacek Nakonieczny reporter

Yes i managed to use the tools, ignoring the issue of lack of OST and associated warnings.

Right now I'm after office hours. I think in few hours I'll get hands on IC again and I'll use tips from you.

Best regards Jacek Nakonieczny

2016-01-14T16:45:09+00:00

Scott Wells repo owner

Sounds good, Jacek. Also, when you generate it again with the build linked above, also add the following for diagnostic debugging under Help>Configure Debug Log Settings:

#com.illuminatedcloud.symtab.OfflineSymbolTable

2016-01-14T16:49:41+00:00

Jacek Nakonieczny reporter

attached ic-2.log

Installed new IC version. Increased heap size to 1GB. Enabled logs. Generated OST. 2h later OST was generated. Log for reference.

2016-01-15T10:25:07+00:00

Scott Wells repo owner

Okay. On the one hand, I'm glad it generated; on the other hand, two hours is an exorbitantly long time. I'll check with Salesforce on whether there's any way to make these queries faster. The log you've provided should be useful for that.

2016-01-15T15:39:07+00:00

Scott Wells repo owner

Jacek, I've actually reproduced the original error that you encountered ("Connection reset" due to exhausted heap), bumped the heap up to 1GB, and also changed the batch size when querying these overlays from 100 to 900 (any more would exceed the maximum SOQL query length of 20K characters), and I'm now seeing improved times for OST generation. It's still not ideal for REALLY large numbers of overlays, but I think it's as good as I can get it without help from Salesforce. I'll include the larger batch size in the next release so that if/when you do have to regenerate your OST, hopefully it won't take two hours.

2016-01-15T17:41:06+00:00

Jacek Nakonieczny reporter

I'll regenerate OST after next release just for sake of checking performance changes.

I don'y know how process of generating OST looks like. Can it be mapreduce'd over multiple threads? Or perhaps cached based on object/class modification timestamp? Just a brain dump ideas.

2016-01-15T18:35:02+00:00

Scott Wells repo owner

Jacek, I've considered doing exactly these types of things. The real bottleneck is the querying of overlays (i.e., ApexClassIdentifier rows). I'm already doing things to minimize the data that I need to retrieve, e.g., only including accessible identifiers (global from installed managed packages and protected+ from unpackaged classes), and timestamps wouldn't likely help here because of the way this data is structured.

I don't think a map-reduce approach will likely help either, but I do wonder whether multiple threads might, one per-query of 900 items. I'm going to give that a shot and make sure that I'm not just going to end up thrashing that same table/SObject by issuing multiple concurrent queries against it, though.

Thanks for the input, and I'll let you know when there's a new build that will hopefully make this considerably quicker against these more heavily-populated orgs.

2016-01-15T18:48:24+00:00

Jacek Nakonieczny reporter

You mentioned that you're queries ignores WHERE clause. Does it apply to OFFSET/LIMIT as well? Those might eliminate hitting same data from multiple threads by keeping track which parts has been/are processed.

Another thing could be giving user ability to decide from which classes/objects/packages build OST. That's because in crowded org you don't need all that "trash" data that you don't care about.

How about storing intermediate data? User could decide if he wants to start where previous process stopped/paused/crashed or start fresh.

2016-01-15T20:26:16+00:00

Scott Wells repo owner

The SOQL query also specifies a LIMIT clause with the number of items in the IN clause. There's not really an opportunity for an OFFSET because this isn't really paging as much as querying subsets from a partitioned list of IDs where each partition is the maximum number of IDs that can be used without exceeding the maximum SOQL query length limit.

I've considered giving the user the ability to specify which subset of items from the org to add to the OST, and depending on how much of an improvement I can get from the current work I'm doing, I may still look into that. I'm hoping that a combination of larger batches (raising from 100 to 900 on the main bottleneck) and allowing concurrent queries for independent batches gets things into a good range without having to limit the scope of the OST.

As for storing intermediate data, I actually do that now. If something happens that causes the OST process to terminate early, it leaves JSON files around that it can use to resume from the point of failure without having to rebuild things that already queried successfully.

Good brainstorming! Much appreciated!

2016-01-15T20:40:30+00:00

Jacek Nakonieczny reporter

I doubt that this JSON is used because if yes then i would successfully generate OST after n-th try before i created this issue. Or at least i wouldn't start from scratch every time. So ... :)

Are you telling that you do not bind ID's as variable but incorporate them as integral part of query in string form? Did i get it right?

2016-01-15T20:50:47+00:00

Jacek Nakonieczny reporter

And one more thing. Using OFFSET/LIMIT would (could?) eliminate usage of ID's on generating OST for all org. In that case you could be limited only by number of returned objects.

2016-01-15T20:54:38+00:00

Scott Wells repo owner

I can't use bind variables in these queries because I'm running the queries through the SOAP API. There's not a signature to pass a parameterized query and a list of bind values (to my knowledge, at least). As a result, I have to formulate the full query that's passed into the SOAP partner API's query() method.

The reason I can't use OFFSET here is because I'm not querying all rows in the table, and the WHERE clause is an IN with a set of IDs for the rows that I've already found to be accessible to the Apex you'd write in the IDE. I can only create a SOQL query with an IN clause of ~900 of these IDs before I create a query that surpasses the maximum length, so I have to partition the IDs into batches of 900 and execute a query for each. There's not a way to use a query locator here or to use OFFSET/LIMIT to page because there's not one contiguous result set that can be queried in a single statement.

Hopefully that explanation of why I'm having to do it this way makes sense...

2016-01-15T21:12:42+00:00

Scott Wells repo owner

Oh, and it's worth pointing out that the reason I perform a first-pass filter to determine which IDs to query based on accessibility/visibility is because otherwise the ~14K rows that I have to query in your environment is closer to ~100K rows (or more) because things like private identifiers are included as well. That not only makes these queries take forever, but the unmarshalled SOAP response ends up being huge and can cause OutOfMemoryErrors in IntelliJ's host JVM.

2016-01-15T21:33:23+00:00

Scott Wells repo owner

One more thing. The JSON documents right now are for full steps in OST production, not for batches of steps. So for example, if one run successfully queries system classes, custom classes, and SObjects, then fails querying overlays (this is what happened with you), then the next time you generate it'll be able to pick up the first three from JSON files and will start making server calls with the fourth. That's why repeated runs didn't resolve the issue for you until you bumped up the heap size.

2016-01-15T21:37:49+00:00

Jacek Nakonieczny reporter

That sounds reasonable. You don't need to use query locator to utilize LIMIT/OFFSET because you can calculate them instead of usage of ID's.

AFAIK you cant access objects records and event fields if you don't have access permission set. This was cause of few surprises in usage of Real Force Explorer usage. Parameterized queries are used in RFE so there's a way as mentioned in ~~#188~~ :)

What's about usage of anonymous apex [6794086] marker is 2nd log file? Could you somehow use this?

Those 2 extra comments put me out of zone ... :)

2016-01-15T21:45:15+00:00

Jacek Nakonieczny reporter

Regarding visibility/accessibility could you use first-pass query as a sub-query of 2nd one?

Random thought about query optimization. https://www.youtube.com/watch?v=uyyXdEIieN4

2016-01-15T22:04:24+00:00

Scott Wells repo owner

Unfortunately the two queries are against different resources. One is against the Tooling API and the other is against the Partner API (and an undocumented API at that; that's how I'm able to produce a strongly-typed SDK without ANY or Object all over the place).

As for what RFE is doing, it's simulating bind params, but it's still building the full SOQL query based on the supplied values before issuing the statement via the API. For better or worse, the only available signatures in the SOAP API are query(String queryString), queryAll(String queryString), and queryMore(QueryLocator queryLocator). There's no available signature that inherently allows bind params. I sure wish there were!

I made pretty solid progress over my late lunch today on parallelizing these queries (as well as others used to construct the OST) and am seeing quite decent results from it. When I get to a good point on this work, I'll post a test build if you'd like to give it a shot and see whether/how much it improves OST performance for you.

2016-01-15T22:11:52+00:00

Jacek Nakonieczny reporter

Undocumented API :+1:

Still wondering about memory footprint, intermediate files for smaller chunks.

Glad to hear about the progress. I'll give it a try for sure.

2016-01-15T22:29:43+00:00

Scott Wells repo owner

Yeah, the folks responsible for tooling at Salesforce are aware of the API I'm calling. There's a whole interesting history around that one, but the end result is that I'm able to get much more information about the system and custom classes in an org than are otherwise available through the standard APIs...the types of things that make real-time expression type evaluation against collection types even possible. Without that, completion, reference navigation, etc., would be pretty broken in the IDE.

2016-01-15T22:46:52+00:00

Scott Wells repo owner

Jacek, I worked on this more today and almost have something ready to share (likely via a pre-release/test build initially). Looking at the second log you supplied, the big chunks of time are:

Loading custom Apex classes - ~6 minutes
Loading SObject types - ~40 minutes
Loading Apex class identifiers (overlays) - ~64 minutes to load ~40K in batches of 100

The changes I've made run the queries for each of these (plus a few others that didn't account for much time) in parallel, and also adjust the last one to query in batches of 900.

I've seen significant savings on the first two, in particular loading SObject types (~80-90% reduction in time). The main savings I'm seeing on the last one (which is obviously the bulk of the time) are in the increased batch size (and therefore less queries overall) rather than the parallelization. Hopefully we'll see that go to ~1/9 of its current time. If so, perhaps we could see these go to ~3 minutes, ~4 minutes, and ~7 minutes respectively, taking your runtime from ~2 hours (which is admittedly silly) to ~15-20 minutes (which is still long, but significantly improved).

I'll let you know when a new build is available and how you can turn on parallel processing (I don't have it enabled by default). I'll try to target making that available Monday.

2016-01-17T00:25:32+00:00

June Bischoff

I've also got a project that has a very large number of objects in the org and is failing on OST generation. (After well over an hour, with the org locked the whole time, per Issue ~~#211~~). Following this ticket with interest!

2016-01-18T15:50:28+00:00

Scott Wells repo owner

I'll try to spin a pre-release build with these changes in the next few hours and will post it here. I'll be interested to get feedback on whether/how much it helps.

2016-01-18T16:20:59+00:00

Scott Wells repo owner

attached IlluminatedCloud-14.zip

Okay, here's a pre-release build with the aforementioned optimizations. You can enable parallel processing during OST generation in the connection settings by checking Use Parallel Processing (Beta). I'd ask that you also turn on some additional debug logging by adding the following in Help>Configure Debug Log Settings:

#com.illuminatedcloud.symtab.OfflineSymbolTable

I've tried this on the largest orgs to which I have access, but none of them are nearly as large as the ones you guys seem to be using. Hopefully we'll see significant improvements, but until we give it a try, obviously I can't be certain.

Oh, and please make sure the heap size for IntelliJ is large enough to accommodate this process. Jacek, I know you bumped yours to 1GB. I would suggest each of you bump to ~1.5GB if you can, and the debug logs will give an indication of how much memory is actually required during OST generation. Given the use of multiple threads, it could well be higher than when processing in serial.

Obviously I recommend you fire this off sometime when you're not needing access to the IDE since it may still take a long time to generate.

Please let me know your findings, and even if they're not significantly improved (or if this fails for you), that just gives me more information to keep chiseling away at this!

2016-01-18T17:05:38+00:00

Jacek Nakonieczny reporter

attached ic-3.log

Ok, it was much faster than previous attempt. Time went down from 2h to 15min. In this case it's great performance boost. Good job :+1:

On the other hand there was many connection resets (in the background, but still) and some internal errors.

2016-01-19T08:15:20+00:00

Scott Wells repo owner

Thanks, Jacek. Let me study the log a bit and see what we might try next. Obviously with all of these connection resets, you're not getting the full OST in the end.

I do see that some things like loading custom classes went from ~6 minutes to ~30 seconds with no errors, and loading SObjects went from ~40 minutes to ~2 minutes with no errors. Those two alone are pretty significant gains! What I may do is leave parallel processing of those and disable parallel processing of the class identifiers (that's where all the errors/connection resets occurred) and post a new build. You're still going to get the batch size increase from 100 to 900, so hopefully that gives you an OST with no errors in much less time and therefore a new baseline for generation times from which to evaluate improvement.

2016-01-19T12:12:34+00:00

Scott Wells repo owner

attached IlluminatedCloud-14.zip

New test build with parallel processing enabled for everything but class identifier loading, and those are still loaded in batches of 900 vs. 100 previously. Hopefully this generates a valid OST properly in much lower time.

2016-01-19T12:40:48+00:00

Jacek Nakonieczny reporter

attached ic-4.log

There was noticeable performance degeneration.

2016-01-19T15:21:46+00:00

Scott Wells repo owner

Thanks, Jacek. It looks like this one generated properly, though. The previous one did not. The total runtime to generate this time was ~25 minutes (started at 15:46:46; finished at 16:09:02) compared to ~2 hours previously. Of the long-running steps identified above, you had the following improvements:

Loading custom Apex classes - 27 seconds (previously ~6 minutes)
Loading SObject types - 2 minutes 16 seconds (previously ~40 minutes)
Loading Apex class identifiers (overlays) - 18 minutes (previously ~64 minutes)

This also included a few other bug fixes, though I still see one in that log that I want to address (the StackOverflowError during indexing).

Obviously a 25 minute runtime to generate the OST isn't ideal, but coming down from two hours, I'll consider that progress. I'll get these changes into a full release in the next couple of days. I'll keep looking for opportunities to improve it, but since parallelization of class identifier loading causes connection issues, I'm concerned about how much more I'll be able to squeeze out of it without other enhancements such as selective generation.

June, I'll be interested to know your findings with this build as well when you get a chance.

2016-01-19T15:48:50+00:00

Jacek Nakonieczny reporter

Sure that's great improvement in comparison to 2h, no doubt. Wonder what are mode detailed stats per single batch. Single one takes from 12-25 seconds (Method class identifiers).

Is it heavily depended on SFDC response time? What's analyze/query time ratio? What causes connection reset (too much API calls?)? How about split it to multiple "query" threds and one "alanyzer" thread?

2016-01-19T16:05:38+00:00

Scott Wells repo owner

I actually already have parallelization in two locations when doing that operation: 1) parallel querying of identifier types (class vs. method vs. variable); 2) within each type, parallel querying of 900 item batches. I can enable those discretely if we want to see if one or the other runs without raising the same errors and improves time.

I think part of the issue here is that this is the undocumented API, and it does appear that API may not have the same infrastructure build-out or quality-of-service as the documented APIs. As a result, I haven't seen it scale well for parallel queries like the other ones (custom Apex classes and SObject metadata).

Let me know if you want to try out builds with each of these enabled. I'm happy to provide them see whether they help without introducing failures.

2016-01-19T16:33:41+00:00

Jacek Nakonieczny reporter

Why not. Prepare builds and i can check them. Have you considered gathering anonymous statistics? :)

Best regards Jacek Nakonieczny

2016-01-19T16:50:25+00:00

Scott Wells repo owner

Will do. It'll probably be a few hours until I can post them. I'm on a business trip this week and my availability is pretty spotty. I'll attach two builds here, each with one of these parallelization options enabled. I'll be interested to get the results from each.

And yes, I've considered adding a "Help me help you" anonymous stat option. You know, in general I turn those off in apps that have them myself which is why I've shied away, but it'd very likely be valuable. If I add that, I'll definitely want to make it obvious to the user and easy to disable.

2016-01-19T18:01:47+00:00

Scott Wells repo owner

attached IlluminatedCloud-14-ParallelTypes.zip

This build allow parallel processing of identifier types but queries for each type are still performed in serial batches of 900 rows. This will result in a fixed number of parallel threads and will run as long as the longest-running identifier type (likely method or variable).

2016-01-19T19:24:33+00:00

Scott Wells repo owner

attached IlluminatedCloud-14-ParallelPerType.zip

This build allow parallel processing of batches within each identifier type but types are still processed in serial order. This will result in a variable number of parallel threads based on how many 900 item batches are required.

2016-01-19T19:26:47+00:00

Scott Wells repo owner

Jacek, new builds are attached. Let me know how things go with them if you get a chance to try them. Logs in particular are useful for comparing runtime and stability. Thanks!

2016-01-19T19:31:31+00:00

Jacek Nakonieczny reporter

Great, let me run OST. Results should be in no time.

2016-01-19T20:03:22+00:00

Scott Wells repo owner

"Results should be in no time". That alone is great to hear given where we started!

2016-01-19T20:29:20+00:00

Scott Wells repo owner

Jacek sent logs out-of-band and it looks like both of these attempts to parallelize the class identifier step result in connection reset errors while calling the API. Additionally the times didn't really improve much even trying to run things in parallel. I'm going to get the changes that did work (attached here as IlluminatedCloud-14.zip) ready for release so that everyone can have access to them. I'll also continue to work with Salesforce on the Tooling API gaps that are being addressed by this undocumented API so that ideally they can be avoided in favor of the much higher performance official APIs. I think that's where the real gains will be made at this point.

2016-01-19T22:59:42+00:00

Scott Wells repo owner

changed version to 1.6.8.1

2016-01-20T14:37:56+00:00

June Bischoff

attached idea.log

Finally got a chance to install IC14, and my OST generated! It still took something like 2-1/2 - 3 hours, but it appears at a very first glance to have been successful. Log is attached.

2016-01-20T14:38:13+00:00

Scott Wells repo owner

Thanks, June. I'll scan the logs in a bit and see what I can learn. Did you enable parallel processing for the OST generation?

2016-01-20T14:41:53+00:00

June Bischoff

I don't believe I did; I just did a right click to generate the OST.

2016-01-20T14:42:51+00:00

Scott Wells repo owner

June, it doesn't look like yours executed with the change I made to improve OST generation times. In particular the bulk of your OST generation time is querying custom classes which is one of the areas of greatest improvement (Jacek's went from 6 mins to 27 secs), so I'm hoping to see your time come WAY down.

Did you install the build attached to this issue? For what it's worth, I'm planning to release this build with a few other small fixes today, so if you just want to wait for 1.6.8.1 and try with the official build, that's fine.

2016-01-20T14:50:17+00:00

June Bischoff

I thought I did, but it was definitely in the brain-scattered portion of the evening! I'll wait for the official build and if you think it'll be helpful I'll attach a log from my next OST regen to this ticket.

2016-01-20T14:52:28+00:00

Scott Wells repo owner

changed status to resolved

Delivered in 1.6.8.1.

2016-01-20T16:19:51+00:00

Scott Wells repo owner

June, I've released 1.6.8.1. After the update, make sure to check the Use Parallel Processing (Beta) checkbox for the connection and then generate the OST. I'll definitely want to see the log from the generation run. Hopefully it goes quite quickly for you!

2016-01-20T16:22:34+00:00

June Bischoff

Thank you! I'm going to wait until after 5pm to kick off the OST regen, in case we hit the org lock again.

2016-01-20T16:25:21+00:00

Scott Wells repo owner

June, did you get a chance to generate the OST? No hurry at all, of course, but curious to know results. I just heard from another set of users who were seeing ~20 min generation times before and are seeing ~2 min generation times now, so definitely encouraging!

2016-01-21T15:27:53+00:00

June Bischoff

17 MINUTES!!!!!! Woohoo! BIG improvement over 2-3 hours.

2016-01-21T16:53:39+00:00

Scott Wells repo owner

That's wonderful to hear, June! Thanks for following up. Would you mind sharing idea.log so I could review some metrics?

2016-01-21T17:16:31+00:00

June Bischoff

attached idea.log

Heh. That was the point, wasn't it. I'm not at all distracted, no.

2016-01-21T17:22:44+00:00

Scott Wells repo owner

Thanks, June. So it looks like you saw the biggest wins from loading custom classes which now executes in ~25 secs and loading SObjects/docs which now executes in about a minute. The latter was definitely your big bottleneck before because it was performing ~2.3K sequential queries, and now those are parallelized. The bulk of the time is, as expected, loading the Apex class identifiers that yield a strongly-typed OST, accounting for about 14.5 minutes. Unfortunately I'm not sure I can get that down much more without help from Salesforce, but I'll continue looking for creative options.

Nonetheless, 2-3 hours down to 17 minutes is a great data point!

Thanks again for the info...

2016-01-21T18:27:26+00:00

Comments (58)