Tre

Emergency Network Maintenance - North America

3 posts in this topic

SERVICE ALERT 9:31AM JANUARY 25th, 2019

 

 

 

DETAILS AS FOLLOW

===========

 

Alert ID: 538365242363609088

 

DETAILS AS FOLLOW

===========

ShadowICT Engineering has discovered a fault in the primary Canadian network in which is causing unexpected restarts. A thorough investigation into the matter is required. ShadowICT Engineering has escalated this incident to the Network Operations Centre, in which will be intervening on 2019-01-25 11:00:00 EST (UTC -05:00).

 

Affected Servers

===========

1. WIN-GSM.SHADOWICT.TECH

 

2. NEBULA-CHARGED.SHADOWICT.TECH

 

3. GRID.SHADOWICT.TECH

 

4. ROUTING.NETWORK.SHADOWICT.TECH

 

Affected Services

===========

1. Garry's Mod - MilitaryRP

 

2. Team Fortress 2 - MvM #1

 

3. Team Fortress 2 - MvM #2

 

4. Team Fortress 2 - MvM #3

 

5. STAFF COMMUNICATION

 

6. STAFF ROUTING

 

7. ENGINEERING DIVISION - VIRTUAL LOCAL AREA NETWORK

 

8. DATA BACKEND (COMMUNITY/GSO/DISCORD/API)

 

Intervention Details

===========

1. Assign to Level 3 Network Engineer - ID# 151896103850082304 (SICT CORPORATE: @Dom)

2. Shutdown physical hypervisor

3. Remove hypervisor from server rack

4. Perform extensive hardware test

5. Replace faulty hardware (if any) and restore services.

 

Alert Cleared

===========

Started at 11:00AM, January 25th, 2019 by 220482106449461248 (ShadowICT Automation)

 

Cleared at n/a

Share this post


Link to post
Share on other sites

Its sad its down

Share this post


Link to post
Share on other sites

The North American network has recovered and is being closely monitored for further complications.

 

Extensive diagnosis reveals no clear cause for the restarts. Extensive diagnosis details below.

 

Spoiler

Launching a CPU/RAM stress test and taking a look at the temperature 
All the components are properly detected. 
The average temperature of the CPU while running the test was stable at ~46?C-49?C.
The rack did not freeze/reboot during the test.
No errors were found in the BIOS event log. 
Server come back on disk 
ping ok 
IPMI UP 

 

Service alert has de-escalated to level three technicians from the NOC. Service Alert transitioned into Advisory status.

Share this post


Link to post
Share on other sites