Open edX devstack and Ocim sandbox issues

Yep, we should do that, I just ran out of time yesterday :+1:

The process is simply to report it as a CRI issue (and ping Ned). See CRI-206 for an example (reported via SE-2587).

Any knowledge we might already have gained about the issue should be included in the ticket description.

@daniel If you could take care of this as SF, that would be great :slightly_smiling_face:

2 Likes

I just created [CRI-215] - JIRA and pinged Ned. I’ll continue at Log in - OpenCraft

1 Like

Determined a workaround for this issue, and so have updated Ocim’s watched fork configuration for new PR sandboxes to add the following:

# Update 2020-06-23, issues with certs package using pip2
# SE-2810, CRI-215
CERTS_VERSION: open-release/juniper.1
certs_version: open-release/juniper.1

I went to update and redeploy the failed PR sandboxes we have on Ocim and found there’s 22(!) of them that have never had a running appserver, so didn’t bother. This means that we either don’t need these sandboxes to demonstrate the PR, or that we haven’t finished prepping our sandboxes for our OSPRs?

In addition to adding these settings, I also had to use the upstream version of the configuration repo to get my PR sandbox to build. Should we replace the open-craft/configuration fork with the upstream master version of configuration in the watched fork configuration or should we updated our fork instead?

Good to know, thanks @mtyaka! I’ve updated the watched fork to use upstream master… it was really flaky for a while, so if it becomes that way again, we can update our branch and go back to using that.

@serenity @bebop A quick heads-up that the process for reporting and following up on periodic build failures has now been formalized in the handbook. See the updates to the ops reviewer and firefighter roles from https://gitlab.com/opencraft/documentation/public/-/merge_requests/174 for details.

6 Likes

Periodic builds for master are currently failing with the following error:

TASK [forum : initialize elasticsearch] *****************************************************************************************************************************************
fatal: [149.202.187.64]: FAILED! => {"changed": true, "cmd": ["/edx/app/forum/cs_comments_service/bin/rake", "search:initialize"], "delta": "0:00:04.117051", "end": "2020-10-25 21:57:34.776740", "msg": "non-zero return code", "rc": 1, "start": "2020-10-25 21:57:30.659689", "stderr": "/edx/app/forum/cs_comments_service/lib/tasks/deep_search.rake:7: warning: already initialized constant ROOT\n/edx/app/forum/cs_comments_service/lib/tasks/kpis.rake:7: warning: previous definition of ROOT was here\n/edx/app/forum/cs_comments_service/models/constants.rb:2: warning: already initialized constant COURSE_ID\n/edx/app/forum/cs_comments_service/lib/tasks/db.rake:28: warning: previous definition of COURSE_ID was here\n/edx/app/forum/cs_comments_service/lib/tasks/flags.rake:6: warning: already initialized constant ROOT\n/edx/app/forum/cs_comments_service/lib/tasks/deep_search.rake:7: warning: previous definition of ROOT was here\nrake aborted!\nElasticsearch::Transport::Transport::Errors::InternalServerError: [500] {\"error\":\"ClassCastException[java.lang.String cannot be cast to java.util.Map]\",\"status\":500}\n/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-transport-7.8.0/lib/elasticsearch/transport/transport/base.rb:218:in `__raise_transport_error'\n/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-transport-7.8.0/lib/elasticsearch/transport/transport/base.rb:346:in `perform_request'\n/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-transport-7.8.0/lib/elasticsearch/transport/transport/http/faraday.rb:37:in `perform_request'\n/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-transport-7.8.0/lib/elasticsearch/transport/client.rb:176:in `perform_request'\n/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-api-7.8.0/lib/elasticsearch/api/namespace/common.rb:38:in `perform_request'\n/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-api-7.8.0/lib/elasticsearch/api/actions/indices/create.rb:48:in `create'\n/edx/app/forum/cs_comments_service/lib/task_helpers.rb:92:in `block in create_indices'\n/edx/app/forum/cs_comments_service/lib/task_helpers.rb:89:in `each'\n/edx/app/forum/cs_comments_service/lib/task_helpers.rb:89:in `create_indices'\n/edx/app/forum/cs_comments_service/lib/task_helpers.rb:198:in `initialize_indices'\n/edx/app/forum/cs_comments_service/lib/tasks/search.rake:30:in `block (2 levels) in <top (required)>'\n/edx/app/forum/.gem/ruby/2.5.0/gems/rake-12.0.0/exe/rake:27:in `<top (required)>'\nTasks: TOP => search:initialize\n(See full trace by running task with --trace)", "stderr_lines": ["/edx/app/forum/cs_comments_service/lib/tasks/deep_search.rake:7: warning: already initialized constant ROOT", "/edx/app/forum/cs_comments_service/lib/tasks/kpis.rake:7: warning: previous definition of ROOT was here", "/edx/app/forum/cs_comments_service/models/constants.rb:2: warning: already initialized constant COURSE_ID", "/edx/app/forum/cs_comments_service/lib/tasks/db.rake:28: warning: previous definition of COURSE_ID was here", "/edx/app/forum/cs_comments_service/lib/tasks/flags.rake:6: warning: already initialized constant ROOT", "/edx/app/forum/cs_comments_service/lib/tasks/deep_search.rake:7: warning: previous definition of ROOT was here", "rake aborted!", "Elasticsearch::Transport::Transport::Errors::InternalServerError: [500] {\"error\":\"ClassCastException[java.lang.String cannot be cast to java.util.Map]\",\"status\":500}", "/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-transport-7.8.0/lib/elasticsearch/transport/transport/base.rb:218:in `__raise_transport_error'", "/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-transport-7.8.0/lib/elasticsearch/transport/transport/base.rb:346:in `perform_request'", "/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-transport-7.8.0/lib/elasticsearch/transport/transport/http/faraday.rb:37:in `perform_request'", "/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-transport-7.8.0/lib/elasticsearch/transport/client.rb:176:in `perform_request'", "/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-api-7.8.0/lib/elasticsearch/api/namespace/common.rb:38:in `perform_request'", "/edx/app/forum/.gem/ruby/2.5.0/gems/elasticsearch-api-7.8.0/lib/elasticsearch/api/actions/indices/create.rb:48:in `create'", "/edx/app/forum/cs_comments_service/lib/task_helpers.rb:92:in `block in create_indices'", "/edx/app/forum/cs_comments_service/lib/task_helpers.rb:89:in `each'", "/edx/app/forum/cs_comments_service/lib/task_helpers.rb:89:in `create_indices'", "/edx/app/forum/cs_comments_service/lib/task_helpers.rb:198:in `initialize_indices'", "/edx/app/forum/cs_comments_service/lib/tasks/search.rake:30:in `block (2 levels) in <top (required)>'", "/edx/app/forum/.gem/ruby/2.5.0/gems/rake-12.0.0/exe/rake:27:in `<top (required)>'", "Tasks: TOP => search:initialize", "(See full trace by running task with --trace)"], "stdout": "W, [2020-10-25T21:57:34.341082 #28234]  WARN -- : Overwriting existing field _id in class User.\nW, [2020-10-25T21:57:34.390196 #28234]  WARN -- : MONGODB | Unsupported client option 'max_retries'. It will be ignored.\nW, [2020-10-25T21:57:34.390303 #28234]  WARN -- : MONGODB | Unsupported client option 'retry_interval'. It will be ignored.\nW, [2020-10-25T21:57:34.390326 #28234]  WARN -- : MONGODB | Unsupported client option 'timeout'. It will be ignored.", "stdout_lines": ["W, [2020-10-25T21:57:34.341082 #28234]  WARN -- : Overwriting existing field _id in class User.", "W, [2020-10-25T21:57:34.390196 #28234]  WARN -- : MONGODB | Unsupported client option 'max_retries'. It will be ignored.", "W, [2020-10-25T21:57:34.390303 #28234]  WARN -- : MONGODB | Unsupported client option 'retry_interval'. It will be ignored.", "W, [2020-10-25T21:57:34.390326 #28234]  WARN -- : MONGODB | Unsupported client option 'timeout'. It will be ignored."]}

@Agrendalath @pooja @demid @usman Could you please follow up on this as SFs for the current Sprint 232 (following the process mentioned above?

@tikr, unfortunately I don’t have too much free FF capacity at the moment. I can take a look at this on Friday if nobody else has time before then.

1 Like

@tikr Since we have a public ML where the build logs are announced now, would it make sense to post messages like this there, with the firefighters in CC, so that people outside of OpenCraft could know about the detected (and manually confirmed) breakage? I was discussing this with Ned during the contributors meetup from last week, and he mentioned he would need help to confirm the build errors that arrive there, and help him (and other people in the community) understand which errors to investigate. This way it doesn’t have to be just us investigating, and other community members with the same issue could also react if they see a similar issue on their side.

1 Like

@tikr I suspect it has to do with: https://github.com/edx/cs_comments_service/pull/327. I’ll investigate it further tomorrow.

2 Likes

@antoviaque Sure, posting these updates to the public ML sounds good :slight_smile: How can I join it?

A few additional notes/questions:

  • To make sure that everyone on the team is in the loop about these breakages, would it make sense to still post them here as well (in the form of a link to the corresponding ML posts, perhaps)?
  • Re: helping Ned confirm build errors and understand which errors to investigate: What info would be good to include in the ML posts to address this? (The Trello card that you linked to didn’t mention this and I didn’t get a chance to watch the recording of the meeting this week.)
  • Once we’re clear on the items above, I’ll have to adjust the changes from https://gitlab.com/opencraft/documentation/public/-/merge_requests/174 so that they match the new process.

Thanks @usman for investigating this one :raised_hands:

When there’s a task for this incident please post it here…

@tikr See:

Sure - maybe just link to the public place internally then, asking to comment/collaborate with everyone on the public thread rather than just between us internally.

The best would likely to ask Ned - though there is already a PR to get a lot of that info directly in the automated email. You can probably start with what you would want to have to be able to access to debug/reproduce the issue.

1 Like

A note that there is a PR being worked on by Diana to resolve the issue which is causing the periodic master builds to fail: https://github.com/edx/configuration/pull/6093.

Tracking this internally in SE-3586.

A new error in periodic builds: https://manage.opencraft.com/instance/17120/edx-appserver/15241/

2020-11-09 08:36:39+0200INFO
TASK [nginx : Write out htpasswd file] *********************************************
2020-11-09 08:36:39+0200INFO
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: ImportError: No module named 'passlib'
2020-11-09 08:36:39+0200ERROR
away and you might need to add |bool to the expression in the future. Also see

Passlib is mentioned somewhere in our playbooks, and it may be where we set the HTTP basic auth password for the LB←→edxapp communication.
Can the firefighters take a look and create a task if needed? @mtyaka @guruprasad @jill @giovannicimolin

@daniel Thanks! Could you do this though:

We should work on those issues with the community and edX, ie publicly, so we can eventually share some of that maintenance burden with them.

1 Like

This issue may not be in Open edX but in our playbooks. I’m not sure and can’t debug it right now, so I’ll let a firefighter decide it and create a task upstream (maybe a CRI- task… or a forum post…).

Yep I’ll take look.

1 Like

If that’s the case, that’s something to simply mention, that it might be an OpenCraft-specific issue - but it doesn’t remove the need to work publicly on this.


To all – please don’t post any more messages or discuss directly in this thread - instead, use the public mailing list, replying on the thread of the failure report if you’re posting about a specific failure, and only link to important messages here for awareness.


2 Likes

Understood.

There isn’t a message in the mailing list yet for this periodic master build failure because according to the logs, it was initiated by an OpenCraft team member? So I’ve temporarily modified the periodic build instance’s build frequency to automatically trigger another build soon, and will comment on the mailing list when the failure is posted there.

Ticket is SE-3613.

1 Like