/************************************************************** * Myrinet protocol module for Sun HPC ClusterTools 4.0 * * Copyright (c) 2001 by Myricom, Inc. * * All rights reserved. * * Author: Karen Wang, Myricom Inc. * **************************************************************/ /* Change log for myr.c */ * CT-1.2-no-mt * myr.c 1.30 (1/10/2005) * mpi_myr.c 1.4 (1/10/2005) * * 1. For DISABLE_MEMORY_REGISTRATION, add the env parameter * GMPI_BOUNCE_BUFFERS to tune the number of intermediary * buffers to use for communications. * * 2. Change defaults for large message pipeline size * * 3. Fix a possible deadlock when bounce buffer is run out, by splitting * the request queue into two fifo: * one for send activities * one for recv activities * This change makes sure send or recv can progress independtly. * * 4. Implement degraded mode through a bounce buffer to guarantee progress * when no more locked-pages are available. gmpi_use_interval returns * 0 in case of error rather than possible partial allocation. * * 5. Function ok_to_send: * don't register memory if going to wait in the FIFO * * * CT-1.1-no-mt * myr.c 1.29 (4/28/2004) * Free alloc'ed memory that has been overlooked. * * CT-1.1-no-mt * myr.c 1.28 (4/26/2004) * mpi_myr.h 1.15 (4/26/2004) * Fix bug in case of a timeout/retransmission occurs, use * gm_drop/resume_send to keep retransmitting until succeed. * * The bookkeeping counters, pending_sends and dropped_sends, are now per-NIC * basis. They are indexed by the gmpi.host_index array. * * Change log for myr.c */ * CT-1.0-no-mt * myr.c 1.27 (2/24/2004) * gmpi_noreg.h 1.3 (2/24/2004) * gmpi_noreg.c 1.3 (2/24/2004) * * The bounce buffers are now hashed by the bounce_buffer->addr * instead of virtual data_addr. This prevent possible corruption * in bounce buffer usage (when several mesages sent from same addr * are completing out of order). * * CT-0.9-no-mt * myr.c 1.26 (12/01/2003) * mpi_myr.h 1.14 (12/01/2003) * Work around a bug in GM about send errors recovery when using gm_drop/resume_send, * affecting configuration with at least 2 processes per NIC. * * CT-0.8-no-mt * myr.c 1.25 (9/16/2003) * Support both gm-2 and gm-1, using gm_unique_id to globally identify GM nodes. * mpi_myr.h 1.13 (9/16/2003) * * CT-0.7-no-mt * myr.c 1.24 (5/19/2003) * 1. Remove a rare case of deadlock when memory registration is disabled. * 2. Input the code from mpich-gm for memory registration -- untested. * ptmalloc (suppose a thread safe malloc), bsdmalloc and dlmalloc. * mpi_myr.h 1.11 * * CT-0.6-no-mt * myr.c 1.23 * Fix rq->r_buf with negative displacement. * Use mpip_data(rq->r_buf, rq->r_type) instead of raw rq->r_buf * to obtain the starting address of user buffer passed by MPI's * API functions. * mpi_myr.h 1.10 * * myr.c 1.22 * Use correct INT_DIGIT for up_string and downstring to represent * integers. * * CT-0.5-no-mt * myr.c 1.21 * mpi_myr.h 1.10 * Minor change for the size of the string for RTE_[g/s]et_string * * CT-0.5-no-mt * myr.c 1.20 * mpi_myr.h 1.10 * Correct the erronous string release with RTE_get_string () * This fixed the problem with UH where namd2 cannot spawn * over multiple nodes. Memory error. * * myr.c 1.19 * mpi_myr.h 1.10 * Minor changes for the error message wording. * * CT-0.4-no-mt * myr.c 1.18 * mpi_myr.h 1.10 * * 1. Add more error catching code for timed-out GM messages, using * gm_drop_sends upon message time-out. Re-send all dropped messages * before any new GM message is put on the wire. The original order * is maintained relying on the in-order return of all the dropped * GM messages. * * 2. Back up the optimization of immediate return of MPI_Send for short * messages (failed intel persist_request/MPI_Startall1 etc.) * * CT-0.3.3-no-mt * myr.c 1.17 * mpi_myr.h 1.9 * * 1. Initialization sync via RTE_set/get_string to collect/distribute * the dynamic competed GM port and node info. * * 2. Allow program to run over Myrinet PM correctly based on the * possible discrepent info from RTE and /opt/SUNWhpc/conf/gm.conf. * Discard GM sanity checking for the RTE-unused nodes, that are * listed in conf file. This change allows more network/system * recognizable by RTE but not Myrinet PM without modifying the * conf file. * * CT-0.3.2-no-mt * Initialization sync via administration GM port * myr.c 1.16 * mpi_myr.h 1.8 */ /* revision clustertools (HPC 4.0) 1.14 * Clean up error catch code mpip_errcomm. * * revision clustertools (HPC 4.0) 1.13 * A complete re-write of Myrinet PM for CT 4.0 * * revision clustertools (HPC 4.0) 1.12 * 1. Back up fix in 1.11, provide correct multiphase rndv bug fix * for 1.10 in myr_recv_match for RDATA1 msg -- malloc * rq->r_data->buf_item for RDATA #1. * 2. Fix local initialization read/write synchronization error of 1.7. * * revision clustertools (HPC 4.0) 1.11 * Fix multiphase rndv bug introduced by 1.10 in myr_recv_match * * revision clustertools (HPC 4.0) 1.10 * Keep a list of free send DMA control buffers in MYR PM for control * messages. At short gm_send, one such buffer is given to GM. At * send_done event, this buffer is put back to the list. This change * improves 0-length messages latency for 5~10us by avoiding the use * of gm_dma_malloc/gm_dma_free each time. * * revision clustertools (HPC 4.0) 1.9 * Handle fast recv event immediately. * Handle GM send/recv tokens in MYR PM. * * revision clustertools (HPC 4.0) 1.8 * Delete function myr_get_env_from_event. * * revision clustertools (HPC 4.0) 1.7 * Add local initialization via shmem instead of GM msg which requires * loop back connection, and therefore Myrinet switch even for 2-node * cluster. This change enables a simple trial cluster of two nodes * connected by point-to-point connection without switch. * * revision clustertools (HPC 4.0) 1.6 * Add multithread support. * * revision clustertools (HPC 4.0) 1.5 * Upon receiving a cancelreq, function mpip_cancel_request is called * to allocate a cancel_reply request. But 3 major field in cancelreply * request is not assigned, i.e. cancelreply->r_peer, r_comm, r_tag * (in mpi/util/cancelling.c). The outgoing CANCELREPLY therefore * is not correctly recognized by the receiver (need comm, rank). * Fix the field assignment right after mpip_cancel_request(). * Test case: probe_cancel/MPI_Cancel_isend, MPI_Cancel_issend * MPI_Cancel_persist_send and MPI_Cancel_some. * * revision clustertools (HPC 4.0 Beta) 1.4 * When called by MPI_Finalize after an active receive request is freed * by MPI_Request_free, mpip_myr_closeconns may have mpip_nactvreqs > 0 * which is supposed to be finished by other PM (such as shm). This is * a non-appropriate practice, although it is tested in intel-test-suit * MPITESET/Test/c/persist_request/functional/MPI_Request_free_p * (MPI_Recv_init, MPI_Start, MPI_Request_free). Using TCP PM always * let SHM PM finishes the request though, maybe because TCP is slow. * We make Myrinet PM robust to this situation by just check * outstanding MYR PM events -- the outstanding_sends counter. * * revision clustertools (HPC 4.0 Beta) 1.3 * revision 1.17 2001-06-27 (continue from HPC 3.1) * Clean up rq->r_data problems, make sure they are set to 0 for * non-multiphase redezvous communication. * Involved functions: uirecv, myr_process_queued_env * * revision clustertools (HPC 4.0 Beta) 1.1 * revision 1.16 2001-03-21 (continue from HPC 3.1) * 1. Clean up unused initialization code (initconns, makeconns, * addconns) copied from tcp.c and, * (1). myr_parse_configstring, * (2). myr_init (previous myr_gmpi_init), * (3). myr_resource_alloc (previous myr_gmpi_subinit_snd_rcv) and, * (4). myr_resource_free (previous myr_gmpi_free), * (5). myr_get_remoteports (previous myr_getports), * (6). add myr_getenv for more env setting for tuning myr PM * 2. in frecv DATA handling, set bb1->sndreq = (MPI_Request) env.ev_sndreq; * previous was bb1->sndreq = req. This may be the reason cause * a send req->r_state set to recvtail in _myr_sent (GM_HIGH_PRIORITY) * 3. Change the location of frecv_unexpected code to be properly used * under 2 conditions. * 4. Use env MPI_MYR_CONF instead of GMPI_CONF for myr PM's config file * * revision 1.15 2001-02-28 * Bug fixed for mpip_finipackfly being called with illegal parameters. * in myr_send_from_queue SENDACK, gm_send_context->packer * should be set to rq->r_data->packer for rndv recv. * * revision 1.14 2001-02-22 * Modified for 16 ports per Myrinet card support * Each board has 12 ports for mpi now. /opt/SUNWhpc/conf/gm.conf need * to be modified accordingly * * revision 1.13 2000-10-02 * 1. cleanup size_t etc. ILP32 LP64 confusion * 2. take care of _GM_SLEEP_EVENT returned by gm_receive(), which previously * cause program halt (_GM_SLEEP_EVENT was taken care of by calling * gm_unknown which put gm into sleep. * * revision 1.12 2000-08-29 * 1. backup broken req_set/test_recvtail * * revision 1.11 2000-08-10 * 1. fix a bug in myr_recv which leads to a recved msg not been processed * 2. fixed req_set/test_recvtail * 3. clean up printf statements * * revision 1.10 2000-08-08 * fixed adm_gm_port sent_done event and port close * * revision 1.9 : 2000-08-07 * 1. Clean up function mpip_myr_frecv's code * 2. Clean up memory allocation and de-allocation (except for myr_parse_configstring) * * revision 1.8 : 2000-07-26 * 1. fixed recv_match wrong state at receiving ACK. * after a rts or a tail msg is sent out, the state of the send request * is set_recvack in _myr_sent. Since there is a chance the waited ACK arrives * prior to the sent_done event is received, we hold the received ACK in * a queue and don't process it until the state has become recvack.. * 2. fixed fsend multiphase rendezvou send, recv site use MPI_Recv (recv_match) * * revision : 2000-07-12 * fixed infinite loop with multiple gm ports due to erroneous blocking progress engine * * revision : 2000-06-26 * fixed infinite loop due to erroneous protocol (myr/shm) used with multiple gm ports * * revision : 2000-06-06 * take care of truncated messages (actual send msg longer than the posted recv msg) * * revision HPC_3_1__0_1: 2000-06-01 * first working version */