From: Stephen L Johnson Date: Thu, 18 Jun 1998 15:58:57 +0000 (+0000) Subject: Initial import X-Git-Tag: start~47 X-Git-Url: http://git.etc.gen.nz/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=0d43ba37273909ab528e97ce040879521eb95d71;p=spong.git Initial import --- diff --git a/changes-2.1 b/changes-2.1 new file mode 100755 index 0000000..b4ee214 --- /dev/null +++ b/changes-2.1 @@ -0,0 +1,375 @@ + +I'm sure the Gold App Dev Request form for all of this is just delayed +in Campus Mail... + +This email note is an epic. Ed in the role of the protagonist, Doug in +the role of the antagonist (of course). Save this note, it's the only +documentation you will get for a while... + + +Here is a summary of the changes that I made to spong. The new version +of Spong was put into production this morning. I came down and walked +the day shift through the new features. I'll leave a note for the 2nd +and 3rd shift. The changes that I made should not significantly affect +how they use it. + +But anyway - first the single, lone, isolated, only, request that I +didn't do. + +> -------------------------------------------------- +> (*) Change "spong" name to "pong" or something doesn't get confused with +> "sponge", +> -------------------------------------------------- + +Changing the name would just cause additional confusion, and requires +more effort then what it is worth IMNSHO. Feel free to make fun of me for +being attached to the name if it makes you feel better... + + +> -------------------------------------------------- +> (*) Change "procs" name to "jobs" to lessen confusion about what is down, +> -------------------------------------------------- + +Done, the real fix will require the following however. + + 1. You will need to push out a new copy of spong-client to all of the + machines and restart-spong client. + + 2. Once you have that done everywhere, you will need to clean out the + spong database on dim, so all records of the old procs service is + removed. + +But since, I know how much you will whine if I try to make you do that, +I also went ahead and added a few little hacks in the spong-scripts that +check for "procs" in various places and replace the string with "jobs" +when it finds it. So you can do steps 1 and 2 above at your leisure... + + +> -------------------------------------------------- +> (*) Improve flexibility of setting down times for various services, +> -------------------------------------------------- + +Basically, I just fixed the way it was supposed to work before, In the +spong.hosts file you can configure a host like the following: + + 'strobe.weeg.uiowa.edu' => { services => 'ftp pop3 http', + contact => 'unix-staff', + group => 'unix-all', + down => [ '*:12:00-14:00', + '1:4:00-6:00' ] }; + +This definition says that strobe is down everyday from noon until 2pm, +and is also down on Monday morning from 4-6am. You can include as many +different downtimes as you want. The first part of the field is the day +of the week (0=sunday), or * indicating that it happens everyday. The +second part is the time range in 24 hour format. + +This wasn't working in the version of spong that you are using, so I +have fixed it. Basically during the times that you define, all of the +services on the machine are "acknowledged" - well not actually, what +really happens is that their services just have a "blue" state, so that +they are not reported as problems to the operators. I could make them +have a unique color if you want, but I think 5 colors is plenty... + +Now, this isn't a full blow cron type thing, but when working I think it +will handle most of the cases. It doesn't handle the silicon case where +just ftp is down at certain times, but that is addressed later... + + +> -------------------------------------------------- +> (*) Include option (default?) that just shows "down" services, +> -------------------------------------------------- + +I believe that the main reason that you wanted to do this is just +because it was so damn slow. Hopefully I've fixed that problem so this +is less of an issue. + +If you still want to see just the hosts with problems, basically you +just need to jump to the URL that loads the left frame. That URL is: + + http://dim.weeg.uiowa.edu/cgi-bin/www-spong/problems/all + + +> -------------------------------------------------- +> (*) Provide clarity in messages so operators page with exact phrasing, +> -------------------------------------------------- + +I've done the following. In both the Problem List frame where it has +the name of the group/person to contact, and on the Host and Service +page where there is a link that says "Contact Staff". Those links are +now "smart", meaning when you click on them - it will take you to the +Operator Paging Page, with the name of the person who is on call for +that machine already selected and a message indicating the problem +already filled out in the message box. So in most cases the operator +just has to press the send button. + +This means that there is an additional file you have to maintain with +the paging cgi script. There is a file called "hosts" in the directory +with the rest of the paging stuff. That file contains a list of hosts, +and the group or individual that is responsible for that machine. + +The default message contains the host name, and if there is a single +problem it lists the service name and summary information (which is +sometimes redundant). If there are multiple problems then it will just +list the services that are red. + +Again, the operators will still have the ability to change the message +if they want. + + +> -------------------------------------------------- +> Improve loading performance. +> -------------------------------------------------- + +Well... Here is what I've done. I believe it's significantly faster +from what some of you have said (but then again I thought the old one +was just fine). + +Got rid of the little dot images, and replaced them with little squares +that are just tables with their background set to different colors. +This seems to get around the problem that Netscape has displaying all +the images. So once the page gets to your browser, it should display +much faster. + +If some day, Netscape gets fixed and you are feeling nostalgic for the +dots, then you can turn them back on in the spong.conf file. + + $WWW_USE_IMAGES = 1; + +Also, the exact color of the various boxes can be adjusted as well. I +have played with a number of variations, and I do think what I have is +pretty good, but if you want to change it, then change the following in +the spong.conf file + + $WWW_COLOR{"red"} = "#cc0000"; + $WWW_COLOR{"yellow"} = "#ffff00"; + $WWW_COLOR{"green"} = "#339900"; + $WWW_COLOR{"purple"} = "#990099"; + $WWW_COLOR{"blue"} = "#0000ff"; + +> -------------------------------------------------- +> Allow others (e.g., Help Desk) to access spong. This will require +> access control (e.g., passwords) as well as configurable "refresh rate" +> option so that only operators and us are allowed auto refresh. +> -------------------------------------------------- + +Ok, I'll meet you half way on this one. The access control should come +through the standard web server mechanism. There are 3 parts to spong: + + www-spong - which just allows you to view the spong database + www-spong-ack - which allows you to update the database + page.cgi - which allows the operators to page us + +I would suggest that you basically open up the www-spong program to the +people that need it, and tighten down the ack and page programs so that +you don't have the help desk acking problems or paging you. + +The way that you want to set up this access is up to you (host base vs +user names and passwords, etc...) You are quite familiar with the issues +of maintaining the password files for this type of thing, etc... + +Ok, now that I've passed the buck on that, here is what I'll do. I've +included two new variables in the spong.conf file. They are: + + @WWW_REFRESH_ALLOW = ( '.*' ); + @WWW_REFRESH_DENY = ( 'edhill', '128.255.51\.\d+', 'traitor.*' ); + +These lists would contain regular expressions that are checked. ALLOW +is checked first, followed by deny. The following pieces of information +are checked against these regexps - REMOTE_USER (the username if you +protect www-spong using user authentication), REMOTE_HOST (the hostname +of the person connecting), and REMOTE_ADDR (the IP address of the person +connecting). + +If no regular expression is matched from either list, then the +auto-refresh is not included in the output. + +---------------------------------------------------------------------------- + +And over the course of trying to do these things, instead of hearing the +kind words of encouragement that I'm used to over in this building, I +instead heard the following additional whining and tried to address it. +These comments have not been through Rex's tone modifier like the +requests above. + +> -------------------------------------------------- +> I don't know why we use this program anyway, the spong-server seems to +> crash for no good reason at random times. We try to hack in things +> like getting spong to restart spong child when they die, but if think +> was just written well to begin with, we wouldn't have to do that crap. +> -------------------------------------------------- + +I have found a repeatable case where I could get spong to crash. If a +client (either updating or querying spong) would time out (because spong +was too slow) - that client would just shut down the connection. + +Well, the spong-server was still trying to write to a PIPE that was now +closed. spong-server was told this with a SIG_PIPE, but I wasn't +checking for that signal, and if you don't check for that signal and you +get it, the program seems to exit. + +So I'm now checking for it, and it no longer seems to go away. I've +left your child restarting code just in case (but as you probably know - +that won't help the case where the parent dies - either could happen +with the case that I fixed.) + +There could of course be other problems then this, but there is now one +less thing that will make spong die. + + +> -------------------------------------------------- +> The history feature is useless, I can't believe it's actually slower +> then the spong summary. It's so slow, it never even comes back before +> netscape times it out. It's so slow, I went off and wrote my own +> little spong-history command because your a bad programmer and blah +> blah blah... +> -------------------------------------------------- + +Ok, I wrote a script called spong-cleanup. It should be run every +night. It does the following: + + * Cleans out any history older then 7 day. It moves the old history + for each host into the /local/www/docs/spong/archive directory. + If you don't think you would ever want to get at that history, then + you can just change the script so that it is deleted. + + * Removes any acknowledgments that are no longer valid. + + * Removes any services that don't seem to be reported any more (if + you stop monitoring something on a machine - the old entry will + still hang around and show up as purple). + +Cleaning out the old history brought the load time down from 90 seconds +to about 4 seconds. That combined with the colored squares instead of +gif images should make retrieving the history via the web usable again. + + +On another note, I actually have a command line program called "spong" +which I have now included in the spong depot package. This is a +client/server program that basically reports all the same things that +the www-spong program does (including history). So you could install +"spong" on your desktop and run: + + spong --history + +or if you just want to see the Unix machines: + + spong --history unix-all + +It also allows you to view the summary table, problem hosts, individual +hosts, etc... Type: + + spong --help + +So if you wanted to do some type of automatic monitoring of the +automatic monitoring system from your desktop, you can use the command +line program to do it. + + +> -------------------------------------------------- +> The acknowledgment mechanism sucks, we want web-ARS. Hey, tie spong +> into the directory server. Hey, why you're at it, build us a directory +> server. +> -------------------------------------------------- + +Ok, you now have the ability to delete existing acknowledgments via the +web interface. When you click on a host, in the Acknowledgment section +next to the descriptions there is a new link which allows you to delete +an Ack. + +If you click on just the generic "Ack" menu item, it will take you to +the same screen as before (allowing you to add a new acknowledgment), +but at the top of the screen is listed all of the pending Acks, so +you can click on them, and either delete them. + +Updating an acknowledgment can be done through this interface now. 1) +first you delete the old one, and 2) then you add a new one that was +similar but different then the old one - Walla, updates 8-) (I ran out +of time...) + +I'll assume that the interface to all of this is self-explanatory. + +Ok, now to the lame solution to the "ftp is down on silicon" problem. +As with the spong command line program, there is also a spong-ack +client/server command line program. In your script on silicon that +disables ftp, you could add the following command + + spong-ack silicon.weeg.uiowa.edu ftp '+18h' 'its all Taos baby' + +You can of course do this in any place that you have a script which is +going to down a service for a period of time. + +You can also delete an acknowledgment through this command line +interface, but it is a little more convoluted. You would run the +following command + + spong-ack --delete silicon.weeg.uiowa.edu-ftp-898185233 + +The little funky looking thing after the delete is the ack id. No, you +probably don't normally know what the ID of your acknowledgment is, but +you can find it with the following + + spong --brief --acks + +yeah, deleting is not the cleanest via the command line, but you get +what you pay for (and no there is not "updating" via the command line, +because in reality there is no updating period - it's all an illusion). + + +> -------------------------------------------------- +> Well what else is broken with this piece of crap package... +> -------------------------------------------------- + +Here is one I had not heard of, but noticed this week. If you had a +host which had a problem with say it's disk, and you acknowledged that +service, but then the host had another problem with say jobs. The host +would not show up in the "Problems list" on the left side of the frame, +so I'm sure the operators would probably not call you about the second +problem. + +I fixed that bug... + + +> -------------------------------------------------- +> Well, I bet you screwed up all the patches that we have had to make to +> duct tape this software together since you last worked on it a long +> long time ago. +> -------------------------------------------------- + +Well, I incorporated all the changes that I could tell including the +following: + + * The paging space addition to spong-client + * Dan's connection to the LDAP server for machine info + * Dave's various fixes for things like PID file removal, etc... + +Basically I incorporated everything that I could tell by the History +file or diffing the various programs. + + +> -------------------------------------------------- +> Why would you get rid of frames you dufus, it was the only thing you +> did right. +> -------------------------------------------------- + +Jeeze, I was just trying to make things faster. Relax, it's back to the +frames version... + + +> -------------------------------------------------- +> Why do you need root access on dim? How about we just give you access +> to the cp and ls commands. All other commands that you want to execute +> will need to be placed in a file called "duh" and submitted for our +> approval. +> -------------------------------------------------- + +My initial response: Grrr... + +My response after accidently chowning /tmp: Yes sir, whatever is best +sir... + + +If there are any problems with any of this, I'm sure you will let me +know. + +