From: Stephen L Johnson <sjohnson@monsters.org>
Date: Thu, 18 Jun 1998 15:58:57 +0000 (+0000)
Subject: Initial import
X-Git-Tag: start~47
X-Git-Url: http://git.etc.gen.nz/cgi-bin/gitweb.cgi?a=commitdiff_plain;h=0d43ba37273909ab528e97ce040879521eb95d71;p=spong.git

Initial import
---

diff --git a/changes-2.1 b/changes-2.1
new file mode 100755
index 0000000..b4ee214
--- /dev/null
+++ b/changes-2.1
@@ -0,0 +1,375 @@
+
+I'm sure the Gold App Dev Request form for all of this is just delayed
+in Campus Mail...
+
+This email note is an epic.  Ed in the role of the protagonist, Doug in
+the role of the antagonist (of course).  Save this note, it's the only
+documentation you will get for a while...
+
+
+Here is a summary of the changes that I made to spong.  The new version
+of Spong was put into production this morning.  I came down and walked
+the day shift through the new features.  I'll leave a note for the 2nd
+and 3rd shift.  The changes that I made should not significantly affect
+how they use it.
+
+But anyway - first the single, lone, isolated, only, request that I
+didn't do.
+
+> --------------------------------------------------
+> (*) Change "spong" name to "pong" or something doesn't get confused with
+> "sponge",
+> --------------------------------------------------
+
+Changing the name would just cause additional confusion, and requires
+more effort then what it is worth IMNSHO.  Feel free to make fun of me for
+being attached to the name if it makes you feel better...
+
+
+> --------------------------------------------------
+> (*) Change "procs" name to "jobs" to lessen confusion about what is down,
+> --------------------------------------------------
+
+Done, the real fix will require the following however.  
+
+  1. You will need to push out a new copy of spong-client to all of the
+     machines and restart-spong client.
+
+  2. Once you have that done everywhere, you will need to clean out the
+     spong database on dim, so all records of the old procs service is
+     removed. 
+
+But since, I know how much you will whine if I try to make you do that,
+I also went ahead and added a few little hacks in the spong-scripts that
+check for "procs" in various places and replace the string with "jobs"
+when it finds it.  So you can do steps 1 and 2 above at your leisure...
+
+
+> --------------------------------------------------
+> (*) Improve flexibility of setting down times for various services,
+> --------------------------------------------------
+
+Basically, I just fixed the way it was supposed to work before,  In the
+spong.hosts file you can configure a host like the following:
+
+ 'strobe.weeg.uiowa.edu' =>  	{ services => 'ftp pop3 http',
+				  contact  => 'unix-staff',
+				  group    => 'unix-all',
+				  down     => [ '*:12:00-14:00',
+					        '1:4:00-6:00' ] };
+
+This definition says that strobe is down everyday from noon until 2pm,
+and is also down on Monday morning from 4-6am.  You can include as many
+different downtimes as you want.  The first part of the field is the day
+of the week (0=sunday), or * indicating that it happens everyday.  The
+second part is the time range in 24 hour format.
+
+This wasn't working in the version of spong that you are using, so I
+have fixed it.  Basically during the times that you define, all of the
+services on the machine are "acknowledged" - well not actually, what
+really happens is that their services just have a "blue" state, so that
+they are not reported as problems to the operators.  I could make them
+have a unique color if you want, but I think 5 colors is plenty...
+
+Now, this isn't a full blow cron type thing, but when working I think it
+will handle most of the cases.  It doesn't handle the silicon case where
+just ftp is down at certain times, but that is addressed later...
+
+
+> --------------------------------------------------
+> (*) Include option (default?) that just shows "down" services,
+> --------------------------------------------------
+
+I believe that the main reason that you wanted to do this is just
+because it was so damn slow.  Hopefully I've fixed that problem so this
+is less of an issue.
+
+If you still want to see just the hosts with problems, basically you
+just need to jump to the URL that loads the left frame.  That URL is:
+
+	http://dim.weeg.uiowa.edu/cgi-bin/www-spong/problems/all
+
+
+> --------------------------------------------------
+> (*) Provide clarity in messages so operators page with exact phrasing,
+> --------------------------------------------------
+
+I've done the following.  In both the Problem List frame where it has
+the name of the group/person to contact, and on the Host and Service
+page where there is a link that says "Contact Staff".  Those links are
+now "smart", meaning when you click on them - it will take you to the
+Operator Paging Page, with the name of the person who is on call for
+that machine already selected and a message indicating the problem
+already filled out in the message box.  So in most cases the operator
+just has to press the send button.
+
+This means that there is an additional file you have to maintain with
+the paging cgi script.  There is a file called "hosts" in the directory
+with the rest of the paging stuff.  That file contains a list of hosts,
+and the group or individual that is responsible for that machine.
+
+The default message contains the host name, and if there is a single
+problem it lists the service name and summary information (which is
+sometimes redundant).  If there are multiple problems then it will just
+list the services that are red.
+
+Again, the operators will still have the ability to change the message
+if they want.
+
+
+> --------------------------------------------------
+> Improve loading performance.
+> --------------------------------------------------
+
+Well...  Here is what I've done.  I believe it's significantly faster
+from what some of you have said (but then again I thought the old one
+was just fine).
+
+Got rid of the little dot images, and replaced them with little squares
+that are just tables with their background set to different colors.
+This seems to get around the problem that Netscape has displaying all
+the images.  So once the page gets to your browser, it should display
+much faster.
+
+If some day, Netscape gets fixed and you are feeling nostalgic for the
+dots, then you can turn them back on in the spong.conf file.
+
+    $WWW_USE_IMAGES = 1;
+
+Also, the exact color of the various boxes can be adjusted as well.  I
+have played with a number of variations, and I do think what I have is
+pretty good, but if you want to change it, then change the following in
+the spong.conf file
+
+    $WWW_COLOR{"red"}    = "#cc0000";
+    $WWW_COLOR{"yellow"} = "#ffff00";
+    $WWW_COLOR{"green"}  = "#339900";
+    $WWW_COLOR{"purple"} = "#990099";
+    $WWW_COLOR{"blue"}   = "#0000ff";
+
+> --------------------------------------------------
+> Allow others (e.g., Help Desk) to access spong. This will require
+> access control (e.g., passwords) as well as configurable "refresh rate"
+> option so that only operators and us are allowed auto refresh.
+> --------------------------------------------------
+
+Ok, I'll meet you half way on this one.  The access control should come
+through the standard web server mechanism.  There are 3 parts to spong:
+
+   www-spong      - which just allows you to view the spong database
+   www-spong-ack  - which allows you to update the database
+   page.cgi       - which allows the operators to page us
+
+I would suggest that you basically open up the www-spong program to the
+people that need it, and tighten down the ack and page programs so that
+you don't have the help desk acking problems or paging you.  
+
+The way that you want to set up this access is up to you (host base vs
+user names and passwords, etc...)  You are quite familiar with the issues
+of maintaining the password files for this type of thing, etc...
+
+Ok, now that I've passed the buck on that, here is what I'll do.  I've
+included two new variables in the spong.conf file.  They are:
+
+      @WWW_REFRESH_ALLOW = ( '.*' );
+      @WWW_REFRESH_DENY  = ( 'edhill', '128.255.51\.\d+', 'traitor.*' );
+
+These lists would contain regular expressions that are checked.  ALLOW
+is checked first, followed by deny.  The following pieces of information
+are checked against these regexps - REMOTE_USER (the username if you
+protect www-spong using user authentication), REMOTE_HOST (the hostname
+of the person connecting), and REMOTE_ADDR (the IP address of the person
+connecting).
+
+If no regular expression is matched from either list, then the
+auto-refresh is not included in the output.
+
+----------------------------------------------------------------------------
+
+And over the course of trying to do these things, instead of hearing the
+kind words of encouragement that I'm used to over in this building, I
+instead heard the following additional whining and tried to address it.
+These comments have not been through Rex's tone modifier like the
+requests above.
+
+> --------------------------------------------------
+> I don't know why we use this program anyway, the spong-server seems to
+> crash for no good reason at random times.  We try to hack in things
+> like getting spong to restart spong child when they die, but if think
+> was just written well to begin with, we wouldn't have to do that crap.
+> --------------------------------------------------
+
+I have found a repeatable case where I could get spong to crash.  If a
+client (either updating or querying spong) would time out (because spong
+was too slow) - that client would just shut down the connection.  
+
+Well, the spong-server was still trying to write to a PIPE that was now
+closed.  spong-server was told this with a SIG_PIPE, but I wasn't
+checking for that signal, and if you don't check for that signal and you
+get it, the program seems to exit.
+
+So I'm now checking for it, and it no longer seems to go away.  I've
+left your child restarting code just in case (but as you probably know -
+that won't help the case where the parent dies - either could happen
+with the case that I fixed.)
+
+There could of course be other problems then this, but there is now one
+less thing that will make spong die.
+
+
+> --------------------------------------------------
+> The history feature is useless, I can't believe it's actually slower
+> then the spong summary.  It's so slow, it never even comes back before
+> netscape times it out.  It's so slow, I went off and wrote my own
+> little spong-history command because your a bad programmer and blah
+> blah blah...
+> --------------------------------------------------
+
+Ok, I wrote a script called spong-cleanup.  It should be run every
+night.  It does the following:
+
+   * Cleans out any history older then 7 day.  It moves the old history
+     for each host into the /local/www/docs/spong/archive directory.
+     If you don't think you would ever want to get at that history, then
+     you can just change the script so that it is deleted.
+
+   * Removes any acknowledgments that are no longer valid.
+
+   * Removes any services that don't seem to be reported any more (if
+     you stop monitoring something on a machine - the old entry will
+     still hang around and show up as purple).
+
+Cleaning out the old history brought the load time down from 90 seconds
+to about 4 seconds.  That combined with the colored squares instead of
+gif images should make retrieving the history via the web usable again.
+
+
+On another note, I actually have a command line program called "spong"
+which I have now included in the spong depot package.  This is a
+client/server program that basically reports all the same things that
+the www-spong program does (including history).  So you could install
+"spong" on your desktop and run:
+
+	spong --history
+
+or if you just want to see the Unix machines:
+
+	spong --history unix-all
+
+It also allows you to view the summary table, problem hosts, individual
+hosts, etc...  Type:
+
+	spong --help
+
+So if you wanted to do some type of automatic monitoring of the
+automatic monitoring system from your desktop, you can use the command
+line program to do it.
+
+
+> --------------------------------------------------
+> The acknowledgment mechanism sucks, we want web-ARS.  Hey, tie spong
+> into the directory server. Hey, why you're at it, build us a directory
+> server.
+> --------------------------------------------------
+
+Ok, you now have the ability to delete existing acknowledgments via the
+web interface.  When you click on a host, in the Acknowledgment section
+next to the descriptions there is a new link which allows you to delete
+an Ack.
+
+If you click on just the generic "Ack" menu item, it will take you to
+the same screen as before (allowing you to add a new acknowledgment),
+but at the top of the screen is listed all of the pending Acks, so
+you can click on them, and either delete them.
+
+Updating an acknowledgment can be done through this interface now.  1)
+first you delete the old one, and 2) then you add a new one that was
+similar but different then the old one - Walla, updates 8-) (I ran out
+of time...)
+
+I'll assume that the interface to all of this is self-explanatory.
+
+Ok, now to the lame solution to the "ftp is down on silicon" problem.
+As with the spong command line program, there is also a spong-ack
+client/server command line program.  In your script on silicon that
+disables ftp, you could add the following command
+
+    spong-ack silicon.weeg.uiowa.edu ftp '+18h' 'its all Taos baby'
+
+You can of course do this in any place that you have a script which is
+going to down a service for a period of time.
+
+You can also delete an acknowledgment through this command line
+interface, but it is a little more convoluted.  You would run the
+following command
+
+    spong-ack --delete silicon.weeg.uiowa.edu-ftp-898185233
+
+The little funky looking thing after the delete is the ack id.  No, you
+probably don't normally know what the ID of your acknowledgment is, but
+you can find it with the following
+
+    spong --brief --acks
+
+yeah, deleting is not the cleanest via the command line, but you get
+what you pay for (and no there is not "updating" via the command line,
+because in reality there is no updating period - it's all an illusion).
+
+
+> --------------------------------------------------
+> Well what else is broken with this piece of crap package...
+> --------------------------------------------------
+
+Here is one I had not heard of, but noticed this week.  If you had a
+host which had a problem with say it's disk, and you acknowledged that
+service, but then the host had another problem with say jobs. The host
+would not show up in the "Problems list" on the left side of the frame,
+so I'm sure the operators would probably not call you about the second
+problem.
+
+I fixed that bug...
+
+
+> --------------------------------------------------
+> Well, I bet you screwed up all the patches that we have had to make to
+> duct tape this software together since you last worked on it a long
+> long time ago.
+> --------------------------------------------------
+
+Well, I incorporated all the changes that I could tell including the
+following:
+	
+    * The paging space addition to spong-client
+    * Dan's connection to the LDAP server for machine info
+    * Dave's various fixes for things like PID file removal, etc...
+
+Basically I incorporated everything that I could tell by the History
+file or diffing the various programs.
+
+
+> --------------------------------------------------
+> Why would you get rid of frames you dufus, it was the only thing you
+> did right.
+> --------------------------------------------------
+
+Jeeze, I was just trying to make things faster.  Relax, it's back to the
+frames version...
+
+
+> --------------------------------------------------
+> Why do you need root access on dim?  How about we just give you access
+> to the cp and ls commands. All other commands that you want to execute
+> will need to be placed in a file called "duh" and submitted for our
+> approval.
+> --------------------------------------------------
+
+My initial response:  Grrr...
+
+My response after accidently chowning /tmp:  Yes sir, whatever is best
+sir...
+
+
+If there are any problems with any of this, I'm sure you will let me
+know.
+
+